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CHAPTER I 

STATISTICS AS A SCIENCE: AXIOMS OF 
PROBABILITY 

1. Introductory. The word “ statistics ” is defined in 
the Concise Oxford Dictionary as follows : in the plural, 
“ numerical facts systematically collected, as statistics of 
population, crime ” ; in the singular, “ science of collecting, 
classifying and using statistics. *’ This definition adequately 
conveys the present meaning of the word ; but the term 
was once restricted, as its derivation shows, to systematic 
collections of data descriptive of political communities, a 
domain partly taken over now by the more special word 
“ demography.” 

The word statistics (in the plural) is used nowadays 
to characterize “ numerical facts systematically collected ” 
in any field whatever of observation or experiment. 
The technique of collecting data and the principles 
to be heeded in order to avoid bias in the interpretation 
are described at length and exemplified in chapters of more 
extensive treatises which the reader may consult. He may 
also form a general idea of practical details by studying 
the prefatory description of method in some actual published 
investigation, for example into housing and economic 
conditions in a particular town or area. In any case th^ 
principles to be observed in arranging a statistical investi¬ 
gation can be thoroughly grasped only when the analysis 
used to interpret the data is well understood ; and this 
involves a knowledge of the science of statistics (in the 
singular). 

The intermediate stage of tabulation , by which collected 

A 
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data are set out in the most perspicuous form for analysis 
or inspection with a particular aim, is also usually the 
subject of a chapter, with illustrative examples and 
criticisms, in larger treatises than the present one. Here 
again the reader may learn much from the attentive 
perusal of statistical year-books and similar publications, 
and from the results tabulated in other published investiga¬ 
tions. The principles are those of logical classification of 
different categories; and the art of tabulation rests in 
making the relation of the categories and the numbers 
in various categories as clear as possible to the eye yet 
compact on the printed page. Thus one may have 
statistics of employed persons according to age, sex, 
district, trade and wage ; how can the respective numbers 
best be set out in one or more tables with rows and columns, 
row-totals, column-totals, sub-totals and grand totals? 
This is a typical problem of tabulation, and the chief aids 
towards resolving it rest on experience and common sense. 

Statistics involves classification by number in categories. 
Let us note for further reference the possible relations of 
individuals in two categories A and B. It may be that an 
individual of the collection cannot be both A and B at the 
same time ; for example if a coin falls “ heads,” it certainly 
has not fallen “tails.” The categories A and B are then 
mutually exclusive ; their relation is that of “ either ... 
or.” On the other hand, the categories A and B may be 
of such a kind that an individual may belong to both at 
the same time ; the relation of such categories is that of 
" both . . . and.” 

' 2. Statistics as a Science. The concern of the 

present book will for the most part be with statistics (in 
the singular) as a science. The typical order of develop¬ 
ment of the " exact ” sciences (as they are somewhat 
loosely called) has been along the following lines. First 
of all, the examination of data collected in a particular 
field of inquiry is found to disclose elements of regularity, 
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suggesting a law or laws. This is the stage of inductive 
synthesis > These laws are expressed, if possible, in the 
form of logical or numerical axioms, resembling those 
of Euclidean geometry. The methods of logic and 
mathematics are then brought into play to develop the 
consequences of the axioms, producing an assemblage of 
theorems or propositions. This department of the science, 
namely the posing of axioms and the deduction of theorems, 
is usually called the pure branch of the science. Even if 
future observations should invalidate the axioms extrinsi- 
cally, the discrepancies between theory and fact being 
too great to be explained away, these axioms and the 
deductions based on them would still have an abstract 
validity, as a logical structure of propositions exempt 
from self-contradiction; but for the description and 
explanation of the phenomena a new set of axioms would 
have to be found. On the other side, the corroborative 
part of the science consists in interpreting the abstract 
functions, formulae, equations, constants, invariants and 
the like, which occur in the pure formulation, as measures 
and measurable relations of actual phenomena, or numbers 
constructed from those measures in a definite way. This 
interpretative discipline constitutes the applied branch of 
the science. 

Such a division or dichotomy into pure and applied can 
be recognized in almost any science. A good example is 
Newtonian dynamics, according to which the motions of all 
bodies in the universe were presumed to obey certain axioms 
and postulates, namely Newton’s laws of force and motion 
and the law of gravitation. Later experiments, more f 
numerous, more delicate, more comprehensive, suggested 
that this formulation, though describing almost all observed 
dynamical phenomena with a precision unprecedented in 
history, did not sufficiently account for certain exceptional 
facts, such as the precession of the perihelion of Mercury. 
The discrepancies between prediction and actuality were 
extraordinarily small, but they were persistent. There thus 
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arose a theory, or rather a succession of supplementary theories, 
of relativity, formulated on a new axiomatic basis by which 
the discrepancies of the earlier one might be reconciled, or 
removed. This reformulation of hypotheses still proceeds, is 
still incomplete, and undergoes modification from time to time. 

What is the axiomatic basis of the science of statistics, 
and what are the facts upon which the inductive synthesis 
is based ? The facts are certain regularities which have 
been observed in the proportionate frequency with which 
certain simple events happen or do not happen, when the 
circumstances under which they may occur are reconstructed 
again and again in repeated trials ; and the axioms, and 
the structure of theorems founded upon them, constitute 
the subject called mathematical probability . As for the 
facts, anyone who is interested can collect a few for him¬ 
self. Spin an ordinary coin a large number of times, 
and one can hardly fail to notice that the proportions 
of heads and of tails are very nearly equal; or shake 
a well-made die repeatedly from a dice-box and one will 
find that after many trials each face of the die has turned 
up in about one-sixth of the total number of trials. 

Example. The reader is recommended to experiment 
with simple repeated trials of this kind, and for future 
reference to record the results in sequence, in the order in 
which they occur. For example, the record of spins of a 
coin might be 

00101 onio 01101 ooooi 10111... 

or the like, where “ 1 ” denotes “ heads,” and “ 0 ” “tails.” 

It is instinctive to look for some cause for this 
^approximate equality of frequency in heads and tails, 
and natural to locate this cause as somehow resident in 
the two-sided nature and appreciable symmetry of the 
coin ; or to ascribe the approximate equality of frequency 
of the faces of the die to its six-sided and nearly uniform 
configuration. Simple ideas such as these suggest by 
generalization and abstraction the axioms of probability; 
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but the choice of axioms may be made in various ways, 
which lead to different formulations of the theory of 
probability. 

3. Survey of Various Definitions of Probability. 

No single particular definition of probability has so far 
met with predominating acceptance. The requisites of a 
satisfactory basis would be these : breadth of application, 
sufficient closeness to the intuitions in which the concept 
originates, and freedom from excessive complexity or 
abstruseness. No theory as yet proposed has been able 
to make these requisites compatible. We may survey 
some contrasting standpoints. 

Probability as the Logic of Uncertain Inference. 
One view is that probability may be regarded as a kind 
of extension of classical logic, an extension conveniently 
described as the “ logic of uncertain inference.’’ This 
view has been expounded by J. M. Keynes in A Treatise 
on Probability (London, 1921), especially in Part II, 
Chapters X-XVII, where references to earlier expositions 
are given. Probability is here regarded as “ the degree 
of our rational belief ” in the truth of a given proposition, 
such belief being contingent on a body of relevant know¬ 
ledge. A logical algebra is developed, but the theorems 
are stated in symbolic, not in numerical or metrical terms, 
and can be applied to the objective problems of statistics 
only by an abrupt and dubious transition from the symbolic 
to the metrical. 

Probability k Priori, and Probability as Relative 
Frequency. As our simple illustrations of the coin and 
the die have suggested, the crude intuition of probability f 
rests on the observation that when a given set of circum¬ 
stances 8 , such as a symmetrical coin spun rapidly, has 
been present on numerous occasions in the past, it has 
been associated in a nearly constant proportion of those 
occasions with some event E, such as the fall of “ heads.” 

The apriorist theory directs attention to the set of 
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circumstances S, or rather to the invariant part of 8. In 
many spins of a coin or die something remains unchanged, 
namely those properties which describe the coin or die 
as a rigid constant configuration. The apriorist will 
regard the probabilities of falls 1, 2, 3, 4, 5, 6 of a die 
as some part of the description of the die, as measuring 
indeed some quality resident in the structure of the die, 
before any spinning is performed. Now the classical 
a priori definition took account only of a very limited 
class of “ systems ” 8, namely those possessing symmetry , 
in the sense that the different aspects (such as faces 1, 2, 
3, 4, 5, 6 of the die) were presumed physically indistin¬ 
guishable. Such an assumption is an idealization of the 
facts, for we can never hope to test completely the 
symmetry of any actual coin or die ; not only would 
the tests be infinitely many and impossibly delicate, but 
the concept of the rigidity and permanence in time of a 
material body is not sustained by modern physics. How¬ 
ever, symmetry being presumed, the six faces 1, 2, 3, 4, 
5, 6 were characterized as “ equally likely ” to be found 
uppermost after any throw, and the probability of 1/6 
was attributed to each of these “ events.” More generally, 
if n equally likely aspects of a proposed system S were 
discriminated, m of these being favourable to the event E, 
the probability of E with respect to 8 was defined as 
p(E ; 8) = min. 

Criticism is easy. The logician will not fail to pounce 
upon the words “ equally likely,” pointing out that they are 
synonymous with “ equally probable,” and that therefore 
probability is being defined by what is probable, a circulus 
kin definiendo being thus committed. Postponing the 
defence, we may pass on to inquire what could be the 
definition of probability, should the tests have disclosed 
asymmetry in 8. The inquiry is most pertinent, for the 
heterogeneous and the asymmetrical are the prevalent 
order of nature, the homogeneous and the symmetrical 
being the exception. One has no difficulty for example in 
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conceiving a die which might be an irregular hexahedron, 
heterogeneous in density and with non-parallel and unequal 
opposite edges and faces. Such dice, and more complicated 
asymmetrical systems, have been subjected to repeated 
trials, which have shown a tendency of relative frequency 
of falls towards a constancy resembling that observed in 
symmetrical systems. 

Stability of Relative Frequency. Another view 
from the angle of “ common sense,” in some respects 
antithetical to the view just mentioned, is the frequency 
view. Here the invariability of the configurative part 
of S, whether symmetrical or unsymmetrical, is tacitly 
assumed, and attention is concentrated upon the sequence 
of trials, and the incidence of E in these. For example, 
the die is thrown again and again. When E occurs, let 
us write 1 ; when E does not occur, let us write 0. A 
succession of n trials then gives a sequence 

A = ffij (l 2 flg Cl ^ ... Of ni • • • (1) 

each dj being 1 or 0. 

Let m be the number of l’s in this sequence. A very 
limited experience, such as spinning a coin or die 10 times 
on several occasions, will show that in a finite number n 
of trials made upon the same system S on two or more 
occasions, different values of m are not only possible but 
usual. Thus, if E is the throw of an ace with a single 
die, 100 throws may on one occasion give m = 15 and 
on another occasion give m = 20. It follows that in order 
to define a probability p(E ; S) which shall be unique and 
not discordant with experience, we must idealize once 
again, postulating a limiting process as n tends to infinity 
and writing 

lim mjn — p(E ; S). . . (2) 

n->QO 

This is in fact a definition, supported by a certain school 
of statisticians, based upon the limit of frequency ratio 
or relative frequency mjn. Though at first sight attractive, 
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it fades a little on scrutiny. Granted the postulate of this 
limit p for one sequence of trials upon S t can we accept 
the more stringent postulate that the same limiting value 
p is obtained for any other infinite sequence of trials on 
S ? Not without further assumptions, for one might 
imagine a mechanism sufficiently delicate to throw heads 
with a coin, or an ace with a die, on almost all occasions. 
There is therefore some restriction on the manner of 
throwing, or on the initial state of 8. This restriction 
is usually stated in the form of a condition that successive 
throws must be “ random/* but this merely transfers the 
burden of explanation to a new and undefined concept, 
“randomness.** To discuss various attempts to define 
randomness would take us too far afield. It is easy to 
say that randomness is absence of any law ; but what is 
“ law ” in this connexion ? 

Another difficulty is that the tendency of relative 
frequency m/n towards a limit p is different in nature 
from the corresponding tendency to a limit which mathe¬ 
maticians have discerned and used in the infinite sequences 
of mathematical analysis. To take a classical example, 
in the sequence defining a certain simple geometric series, 

li 1 —2> 1~"2+1> 1 — 2+i~i>.(3) 

the deviations of the successive terms from £ are respectively 
£, •••> each being numerically half its pre¬ 

decessor, so that, given a small number e, such as 1/1000000, 
we can always find some term sufficiently far along the 
sequence, after and including which all terms deviate from 
£ by less than €. Thus £ is the limit of this sequence. 
,v But what can be asserted concerning the sign and magnitude 
of the deviation € n , considered as a function of n, in 

€ n = mln—p(E : S ) ? 

It would seem that the only kind of assertion about e n 
which would carry conviction would itself involve some- 
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where the notion of probability ; and here the risk of 
committing a circle in definition again raises its head. 

It should be added that the chief defects of the approach 
to probability by limit of frequency ratio have lately been 
removed by the work of de Mises, Copeland, Dorge, Wald 
and others. These writers admit only certain sequences 
A of suitable postulated properties, including that of 
limiting ratio ; but some logical difficulties remain, and 
the modified formulations lose the primitive simplicity in 
which they originated. 

It would seem, however, that a more natural course, 
and one more in line with the general method of science, 
would be to try to explain the effect, namely the relative 
frequency of E, by an analysis of the cause, namely the 
system S. This suggests a return to the a priori stand¬ 
point ; and it may be noted that several authors at the 
present time, Fr6chet, Kolmogoroff, Cramer and others, 
have been independently engaged in rehabilitating the 
a priori definition by furnishing it with a better axiomatic 
basis. 

4. Probability as Measure of a Sub-Aggregate. 

Let us examine more closely the system S, keeping some 
simple system such as a coin or die in mind. The 
approximately constant element in our sequences A, 
namely the almost stable frequency ratio of E , must 
reflect—at least so our intuition suggests—the constant 
element of S, such as the rigid configuration of a coin 
or die; the irregularity which we name randomness 
doubtless reflects the variable part of S, such as the 
initial position, velocity and angular velocity of projection. , 

What is S when an unsymmetrical and heterogeneous 
die is spun and falls ? It consists of (i) the die, specified 
as a particular constant rigid body, (ii) the floor or table 
on which it may impinge or finally rest, (iii) the surrounding 
air, and so on ; together with (iv) the circumstances of 
projection, described by coordinates of initial position, 
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momentum and angular momentum. The coordinates 
specifying the rigidity of the die and the configuration 
of the table or floor are constant components of 8, the 
other initial coordinates of 8 are variable. The set of 
coordinates of 8 at the instant of projection may be 
called the initial phase. Each variable coordinate, such 
as the initial position, or the initial momentum, has a 
certain field of variation. Hence we must assume a set 
of possible phases which, if they can be enumerated in 
some order, may be designated by S v 8 2 , ..., 8 i9 ... ; 
and this ensemble of possible initial phases 8 t constitutes 
an aggregate S of the kind specially studied in pure 
mathematics.* If dynamical determinism be assumed, 
but not otherwise, the initial phase will decide whether 
or not the event E will occur. Consequently the possible 
initial phases may be classified as ^/-phases or not -2£-phases 
(let us say iJ-phases), so that the whole phase aggregate is 
divided into two sub-aggregates. Now the question of 
assigning a measure to such aggregates has been deeply 
studied in modern pure mathematics, the guiding idea 
being that of extending as widely as possible the scope 
of a concept familiar in simple cases, namely the cardinal 
number of a finite set of objects, the length of a line, the 
area of a surface, the volume of a solid. If M is the 
measure of the whole aggregate 8 of possible phases, and 
pM the measure of the aggregate of -phases contained 
in it, then p is the probability p(E ; 8). 

Something has been glossed over here ; there is the 
tacit assumption that the initial phases are “ equally 
likely.’ 1 But let us insist that the question of equal 
likeliness is not one for the abstract formulation at all; 
for to specify the aggregate is in effect to say that its 
elements, the initial phases, are equally likely. For 
example, if the aggregate were of points on a continuous 
line segment, and the measure were ordinary length, then 

* We use the same letter S as before, regarding the system 
now as the totality of its possible phases. 
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we have implied in this description that all points in the 
segment are equally likely. On the other hand, the question 
of equal likeliness is crucial in the application to experiment 
or observation, that is, in applied statistics, where a 
wrong choice of the aggregate may alter all the pro¬ 
babilities. This has long been known in problems of 
so-called geometrical probability. For example, given a 
circle, let a chord be drawn across it at random : what 
is the probability that the length of the chord exceeds 
half the diameter ? It depends entirely on the manner 
in which the chord is drawn. If it is done by taking a 
point on the circumference and then drawing the chord 
at any angle, all angles being thus supposed equally likely, 
then the probability is 2/3 ; but if it is done by taking 
any diameter and drawing the chord at right angles to 
any point taken in the diameter, the diameters and points 
being equally likely, then the probability is \/3/2. 

The inclusion of the words “ equally likely ” in a definition 
is in fact a concession ; it puts the reader more gently at 
terms with the abstract formulation by anticipating its chief 
future application. The usage is not uncommon. When a 
point is defined as “ that which has position but no magnitude ” 
the same appeal is made to an application, but the same 
suspicion of a circle in definition is incurred, for how can 
position be defined without the notion of a point T And if a 
straight line is defined as “ lying evenly ” between its extreme 
points, what else does “ evenly ” moan but “in a straight 
line ” ? Evory definition which is not pure abstraction must 
appeal somewhere to intuition or experience by using some 
such verbal counter as “ point,” “ straight line ” or “ equally 
likely,” under the stigma of seeming to commit a circle in 
definition. 

This prologue, though it has omitted many subtler 
points which could be amplified at very great length, 
must now be cut short. To summarize : (i) events E 

are conceived as associated with, or caused by, phases 
S t of circumstances ; (ii) each S { gives rise unambiguously 
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either to E or to E ; (iii) the phases 8 t form in their 
totality a set or aggregate S, of which the phases favourable 
to j E y and those favourable to E, form complementary 
subsets; (iv) a measure M can be given to the whole set 
8 , and if pM is the measure of the subset favourable to E, 
then p is the probability p(E ; 8) of E with respect to 8 ; 
(v) the question of equal likeliness of phases is the same 
as the question of specifying the aggregate and its measure, 
and in practical applications this must be determined by 
the circumstances of the particular problem. Let us 
finally add that the word phase can be extended to include 
coordinates other than dynamical ones; also that the 
name “ fundamental probability set ” is used by some 
writers for the set 8 of phases S { . 

6. Definition of Probability. In an elementary 
treatment a rigorous formulation in terms of general 
aggregates is not possible. It will be necessary to restrict 
consideration to aggregates with a finite number of elements 
only ; in this case the measure of an aggregate or sub- 
aggregate is simply the number of elements it contains. 
The reader may take it that the theorems can be extended 
to more general aggregates. 

Definition. If an event E can result from the phases 
of a system 8, there being n different phases and no more, 
all equally likely d priori ; and if m of these phases entail 
the occurrence of E (so that n—m do not), then mjn is the 
probability p(E ; 8) of E with respect to 8 . 

Continuous Case. If the event E is described by 
the value of a continuous variable x, we may denote the 
probability that x is found between x+\Ax and x—\Ax by 

p(x+\Ax, x—\Ax \ 8) = Ap(x ; S), . (1) 

let us say. By supposing n to tend to infinity and Ax 
to tend to zero we reach the conception of a differential 
element of probability, or probability differential, 

p(x+\dx, x—\dx ; S) == dp(x ; 8), . (2) 
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which, when no misunderstanding about 8 is likely to 
arise, we shall often denote briefly by dp. 

Complementary Event. The failure of E is denoted 
by E 9 and is called the complementary event. The pro¬ 
bability of E is (n—m)/n 9 namely 1— p in the finite case, 
and likewise in the continuous case. This is often termed 
the complementary probability and denoted by q , so that 
p+q = 1- 

If n is finite and if E must inevitably happen in all 
of the n ways, then p — 1 and E is “ certain,’* while 
q = 0 and E is “ impossible.” If, however, the system 
8 depends on a non-finite set or results in events expressible 
by a continuous variable, we must not suppose that p = 1 
implies certainty, or p = 0 impossibility. For example, if 
a point is taken on a line segment, the chance of a particular 
point P being taken is 0 ; but some point is taken, and so 
the point P cannot be regarded as impossible. 

6. Addition and Multiplication of Probabilities. 
Dependent and Independent Events. An event F will 
be said to be dependent on an event E when the happening 
of either E or E alters the probability of F ; and in the 
contrary case F will be said to be independent of E. An 
extreme case of dependence is that in which the happening 
of either E or F makes the probability of the other equal 
to zero. The events are then said to be mutually exclusive. 
(In the continuous case we must take cognizance of “ almost 
mutually exclusive ” and “ almost independent ” events, just 
as we have of “almost impossible” events for which p = 0.) 

The addition theorem of probability is applicable to 
events which are mutually or almost mutually exclusive. 

Theorem. When an event E may happen in the 
form of any one of r mutually exclusive events E jy 
j = I, 2, 3, ... r, in a system 8 which has n equally likely 
phases, the probability of E i being p i9 then the probability 
of E is 

p(E ; S) = Pi-\-p 2 +...+p r = £pj . . (1) 
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Proof. If ri) of the n phases entail E f then p f = n f /n. 
Since the phases do not overlap (otherwise the events E j 
would not be mutually exclusive) the total number of 
phases entailing one or other of the E i is Zn i ; and so 

i 

p(E ; S) = Erij/n = Zpj. 
i 5 

The theorem, which is sometimes called the theorem of 
Total Probability , continues to hold for systems expressed 
by a non-finite n or by a continuous variable. 

The multiplication theorem, or theorem of Compound 
Probability , refers in the first instance to independent 
events, but can easily be made applicable, with a suitable 
definition of conditioned probability for dependent events, 
to the latter case. 

Theorem. If E jt j = 1, 2, 3, ..., r, are r independent 
events, each with respect to its own system S f , the 
probability that they all happen when all the Sj are in 
operation is 

p(E;8) = Pl p 2 ...p r , . . . (2) 

where E denotes the compound event consisting in the 
happening of all the E jy S denotes the compound system 
consisting in the operation of all the S jt and^ = p{E i ; Sj). 

Proof. Let n i denote the number of phases of Sj, 
and of these let entail E f . Now each of the phases 
of Si may be paired in turn with each of the n k phases of 
S k , giving rise to npi k compound phases of the double 
system (S jt S k ). By similar reasoning the m^ phases 
entailing E f may be paired in turn with the m k phases 
entailing E k > giving rise to m j m k compound phases of 
(Sj, S k ) entailing the double event (E j} E k ). 

By similar reasoning, or step by step, there are 
altogether 7^7i 2 ...n f phases of the compound system 
(S v S 2 , ..., S r ) = S , and of these entail the 

compound event (E v E 2 , E r ) =E. Hence the pro¬ 
bability of E with respect to S is 
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p(E ; S) = m x m 2 ... m r /n 1 n 2 ... n r 
==z PlP 2 * • • Pr m 

Once again we must content ourselves with the state¬ 
ment that the theorem remains true for independent or 
“ almost independent ” systems involving infinite aggregates 
or continuous variables. 

By modifying the definition of p 2 , p Zi ..., p r we 
may prove an analogous theorem for a chain of events 
E lf E 2 , ..., E r , each of which influences the probability 
of its successors. 

Let p 2 = p(E a ; E v S 2 ) denote the probability of E 2 
after E x has happened, p z == p(E z ; E v E 2 , S z ) denote the 
probability of E z after E x and E 2 have happened, and so 
on. Slight consideration will show that this simply 
involves putting the events in an order of time and that 
then, with the new interpretation of p 2 , p z , ..., p T , the 
above proof proceeds exactly as before. Hence we have 
the theorem of compound probability for a chain of 
conditioned events: 

p(E ; 8) 

=p(E 1 )p(E 2 ; E l )p(E 3 ;E 1 ,E 2 )...p(E r ; E v E iy ...,E r . .J. (3) 

These theorems of addition and multiplication of 
probabilities are the fundamentals upon which the mathe¬ 
matical theory of statistics is raised. Since addition and 
multiplication are operations of ordinary algebra, we may 
anticipate that there is an algebra of probability depend¬ 
ing on these operations, according to which expressions 
representing independent systems Sj can be compounded 
in product and the resulting probabilities found by 
inspection of terms. This algebra is the algebra of 
generating functions of probability, which we shall consider 
from an elementary standpoint in the next section. 

Ex. 1. The probability of throwing two consecutive aces 
with a true die is J J = sV 
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Ex. 2. The probability of throwing a head and a tail with 
two coins is 

Ex. 3. The probability of throwing a total of 8 points 
with two dice is 5/36. (The mutually exclusive events are 
6+2, 5 + 3, 4+4, 3 + 5, 2+6.) 

Ex. 4. A bag contains 4 black and 3 white balls. Show 
that the probability of drawing 3 black in succession is 64/343 
if the ball drawn is replaced each time, 8/49 if the first ball 
drawn is replaced but not the others, 4/35 if no ball is replaced. 

Ex. 5. The events E x and E 2 are neither independent nor 
mutually exclusive. Denote by p 12 tho probability that E x 
and E 2 both happen. Prove that the probability that at 
least one of E x and E 2 happen is Pi+P a — Pi 2 . 

Ex. 6. Generalize the preceding theorem to r events 
E l9 E 2 , ..., E r . Prove, with an analogous notation, that the 
probability that at least one of the events happens is 

2Jpj EEpjk+EEEpjki ... ( ) r P\2..y 

7. Generating Functions of Probability. We shall 
often denote the probability that a variable x takes a 
particular value by and we shall use the following 

nomenclature : 

Probability Function. The function <f>{x) is the 
probability function. When the set of values of x is 
continuous we shall write the probability differential 
dp — <f(x)dx for the probability that x is found in the 
range (x—\dx 9 x+\dx). In this case <f>{x) is often called 
the probability density. 

Variate. A variable which has a probability function 
will be called a variate. 

Generating Function. Associated with <f>(x) we 
introduce the generating function (g.f.) of probability, 
defined by 

G(t) == G(t; <f>) = £<f>(Xj) . . (1) 

for variables which take discrete values, and by 

G(t) = [<j>(x)t x dx . . . , (2) 
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for continuous variables, the integral being over the whole 
range of possible values of x. 

Ex. 1. The generating function of probability for heads 
in a symmetrical coin is 

Ex. 2. The g.f. for a symmetrical six-sided die is 
+*•+<•) = j£(l —£ 6 )/(l — t). 

Ex. 3. The g.f. for an unsymmetrical coin in which the 
probability of heads is p, of tails q , is pt+q. 

Ex. 4. Write down the g.f. for a symmetrical four-sided 
die ; also for an unsymmetrical one in which the probabilities 
of faces marked 1, 2, 3, 4 are p x> p 2 , p 8 , p 4 . 

Ex. 6. If all points on the straight line from x = 0 to 
x — 1 are equally probable, the g.f. is 

J* t x dx = {t — l)/\og 6 t. 

8. Properties of Generating Functions. Suppose 
first that we have an event E x and its complement E lt 
of respective probabilities p 1 and q v and a second indepen¬ 
dent event E 2 with its complement E 2 , of probabilities 
p 2 and q 2 . Then the compound probabilities of the four 
mutually exclusive events 

(E lf EJ, (E v Bj, (E v E 2 ) } (E v E 2 ) . . ( 1 ) 

are respectively PiP 2 > Pi<l 2 > <hP 2 > M 2 - Let us relate these 
to the terms on the right of the algebraic identity 

{Pih^r ( h){Pj f 2+<?2) = PiP2 t i t 2 J rPi^i+ ( hP2 t 2^r ( h ( l2' (2) ' 

Study of this identity will reveal the most important 
property of generating functions. The disjunction between 
added terms (those linked by plus signs), both in the factors 
on the left and in the expanded product on the right, 
reflects in each case the disjunction into a number of 

B 
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mutually exclusive events. The operations of multiplication , 
on the other hand, are carried out on expressions symbo¬ 
lizing independent events. For example, the multiplication 
of the two factors on the left interprets the compounding 
of the two independent systems S x and S 2 of which they 
are the generating functions ; and the results of multi¬ 
plication visible in single terms on the right, such as 
PiPj^t?,, represent at the same time the compounded 
probabilities, PiP 2 > and the compounded events, 
characterizing (E v E 2 ). In fact the algebraic operations 
are faithfully carrying out the consequences of the two 
basic theorems of probability. Mere inspection will 
convince us that this is true not only for binomial 
expressions compounded in product as above, but for 
multinomial expressions, as in the following example. 

Ex. 1. Let the reader consider events Ey E 2 , E 3 of pro¬ 
babilities p v p 2 , p z with respect to S , events E' v E' 2 , E 8 , E'* 
with probabilities p v p 2 , p' 8 , p^ with respect to an independent 
system S', and examine the product 

(PA +PJ 2 +pA^p/i +p'A+pVs + p\Q 

in relation to the 12 events of the compound system 
T= (S 9 S'). 

Regarding a compound system ( S , S') as a single 
system and introducing further independent systems one 
at a time, we may prove step by step that to find the 
respective probabilities of all the mutually exclusive events 
arising from the compounding of r independent systems, 
t we must construct the product of r expressions of the 
kind exemplified above, and examine the individual terms 
of the expansion. 

Ex. 2. In an expansion of three factors such a term as 
PlPlP^ityz w °uld be interpreted as meaning that the com¬ 
pound event (E^ E’^ E 2 ) has probability 
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The variables t i and so on are introduced for the sole 
purpose of preventing the terms from being merged 
together ; for when the pj are explicit fractions such as 
f and the like some such device is needed. 

Now suppose the event E d involves the addition of x t 
points to a score, or the assumption by an additive variate 
x of an increment Xj. In such a case we represent E f by t x i 
rather than by t jf taking advantage of the fact that when 
expressions like Pjt x o and p k t x k are multiplied together we 
have by the law of indices p j p k t x j +x k ) the probabilities 
being multiplied as they ought to be, and the increments 
Xj and x k being added as they ought to be. With this 
understanding, the system under which x may assume 
values Xj with probabilities p jt j — 1 , 2, ..., r, is char¬ 
acterized by the expression 

2p# m i .(3) 

i 

But this is merely the generating function G(t) of the 
system, and so we infer the important theorem, for 
discrete variates in finite sets : 

The g.f. of a compound of independent systems is the 
product of the g.f.’s of the separate systems. 

By a limiting process, with due precautions on the 
functions concerned, this multiplicative law can be extended 
to g.f.’s involving continuous variables. Thus, if G^t) is 
the g.f. of the variable x , and G 2 (t) of a statistically 
independent variablo y, then G 1 (t)G 2 (t) is the g.f. of x+y ; 
and so for more than two variables. 

Ex. 3. The probabilities of 3 heads, 2 heads, 1 head and 
no heads in a throw of three symmetrical coins (or three 
separate throws of one coin) are the coefficients of t z , t 2 , t and 1 
in the expansion of namely £, f, }, £ respectively. 

Verify this also by enumeration of cases. (Write H for head, 
T for tail; then the cases are HHH; HHT , HTH , THH ; 
HTT, THT , TTH ; TTT.) 

Ex. 4. The corresponding probabilities when the coin is 
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unsymmetrical, with probability p for heads and q for tails, 
are the coefficients in the expansion of (pt+q) z . 

Ex. 5. The probabilities when the three coins are different 
and unsymmetrical are the coefficients in the expansion of 
(P\P+qi)(Ptt+q2)(Pzt+q*)- 

Ex. 6. The probabilities of n, n—1, ..., 2, 1, 0 heads in 
n throws of an unsymmetrical coin are the coefficients of 
powers of t in the expansion of (pt+q) n . 

Ex. 7. Write down the corresponding g.f. for the 
simultaneous throw of n different unsymmetrical coins. 

Ex. 8. A tetrahedral, a cubical and an octahedral die, all 
symmetrical, are thrown together, their faces being numbered 
in each case from 1 upwards. Show that the probabilities of 
totals 3, 4, 18 are arrayed by coefficients in the expansion 

of 

i-i-id-< 4 )(1-<*)(!-<•)/(!-<)“• 

Ex. 9. A coin is thrown n times. Each time a head 
occurs, 2 is added to the score ; each time a tail occurs, 1 is 
subtracted. The g.f. is 

(¥ 2 +¥~ l ) n = 2- n t- n (t z + \) n . 

Ex. 10. Four tickets marked 00, 01, 10, 11 respectively 
are placed in a bag, and drawn one at a time, being replaced 
each time. Prove that the chance of drawing five times and 
obtaining ticket numbers summing to 23 is the coefficient of 
t 2 u 3 in the expansion of 4~ 5 (1 -\-t+u-\-tu)* — 4- 5 (l-H) 5 (l-ftt) 6 . 

Find this coefficient, and verify the result by enumeration. 

9. Moments and Moment Generating Functions. 

It is convenient to describe a probability function (f>(x ) 
by certain coefficients or parameters connected with it, 
such as moments, cumulants and others later to be defined. 
The momenta commonly employed are based on powers of x , 
and are defined by 

/x r = 2x r <f>(x) or J % x r <f>(x)dx, . . (1) 

according as the variate is discrete or continuous. The 
summation or integration is over the whole range of 
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possible values of x. If the values which x can take are 
discrete and spaced at unit intervals (for example if x 
records the number of heads in n throws of a coin) it is 
mathematically preferable to use factorial momenta , defined 

by 

M(r) = 

where x (r) = x(x— l)(a?—2)... (x— r+1). . (2) 

Note . The privilege often accorded to ordinary “ power ” 
moments is one of custom only ; no special sanctity attaches 
to them. 

Mathematical Expectation. If f(x) is a function of 
x t and <f)(x) is the probability function, or (f>{x)dx the 
probability differential, then the sum or integral 

Zf(x)(l){x) or J f{x)<f>(x)dx . . . (3) 

is called the mathematical expectation of f(x). It is often 
denoted by Ef(x). The r th moment is therefore the 
mathematical expectation of x r . 

Moment Generating Functions. If we put t = e a 
in the g.f. of probability G(t), we obtain 

G(e a ) = U<f>(x)e ax or J <f>(x)e ax dx . . . (4) 

= 1 + 1-p/ 2 a 2 /2! -f-3! - K • •, 

provided that the sum or integral converges over a range 
of a and that expansion of e ax and integration term by 
term is permissible. This function, which we shall denote 
by M(a), may be regarded as generating the moments p' r , 
in the sense that p' t is the coefficient of a r jr\ in M(a). 
Of course a, like t, is a variable introduced to facilitate 
manipulation, in fact to carry the moments. We shall 
call M(a) the moment generating function (m.g.f.) of x or 
of </)(x), or of the system under consideration. 
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Factorial Moment Generating Functions. When 
factorial moments are in question, we can construct a 
factorial moment generating function (f.m.g.f.) very simply 
from the probability g.f. by the substitution t = 1+a. 
For then we have 

G(l+a)=Z<f>(x)(l+a)* .(5) 

= 1 +/*Ji ) a+/*j 2) a*/2!+/*(8)a 8 /3!+ ... 

by expanding (1 +a) x by the binomial theorem and summing 
the resulting terms. 


Example. The f.m.g.f. of the distribution characterized by 
(pt+q) n is (l-fpa) n . 


Note . The reader who is acquainted with more advanced 
mathematics may observe that for moment generating 
functions the substitution t — e iu instead of t = e a has a 
certain advantage. It gives the modified m.g.f. 


jc iux <f>(x)dx. 


. (6) 


a Fourier transform of The integrand and integral are 

bounded, and the reciprocal theorems of Fourier transforms 
are available. 

10. Cumulants and Cumulant Generating Func¬ 
tions. If the logarithm of the moment generating 
function M(a) can be expanded as a convergent series in 
powers of a in the form 

K{ a) = \og e M(a) = /c 1 a+/c 2 a 2 /2!+#c 3 a 3 /3!+ ..., . (7) 

then K(a) is defined to be the cumulant g.f., and the 
coefficients K r are called the cumulants * of the function 
<f>(x). Since m.g.f.’s are compounded in product, c.g.f.’s 
must be compounded in sum, whence the theorem: 

When independent systems are compounded the r th cumu¬ 
lants k t of the separate systems are added to form the r th 
cumulant of the compound system. 

This additive property of cumulants is indeed the 


* The word “ cumulant,” suggested by It. A. Fisher, is to be 
preferred to the older term “ seminvariant, since “ seminvariant ” 
is already appropriated in the theory of algebraic invariants. 
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reason for introducing them. In the same way, by taking 
the logarithm of the f.m.g.f. we can define a factorial 
c.g.f. and factorial cumulants. 

Example. The factorial cumulants corresponding to 
( pt+q) n are np , —np 2 , 2lnp 3 , —3!np 4 and so on. 

11. Change of Origin and Scale in Generating 
Functions. Change of Origin. If the origin from which 
the variate x is measured is transferred from a; = 0 to 
x = a, any value x will be changed to x—a. Hence 
every factor t x in a term of the probability g.f. will become 
tx-a . but the accompanying probability <f)(x ), though 
changed in notation, will not be changed in value. Hence 
the effect is to multiply the whole g.f. by t~ a . 

This very simple rule leads to corresponding ones for 
the m.g.f., f.m.g.f. and c.g.f., namely : 

A change of origin from x = 0 to x = a has the effect 
of multiplying the m.g.f. by e _aa ; of multiplying the f.m.g.f . 
by (l+a)~ a / and of adding to the c.g.f. the term —aa. 

Thus only the first cumulant k x is changed ; it becomes 
k x — a, while k 2 , k 3 , ... are unaltered. 

Change of Scale. If the scale of measurement is 
altered so that what was previously recorded as x now 
reads kx, then every factor t x in the previous g.f. now 
becomes t kx , that is, (t k ) x . Hence in the m.g.f. the previous 
e ax now reads e akx . Hence the rules : 

Change of scale, so that x becomes kx, has the effect of 
replacing t by t k in the probability g.f., a by ka in the m.g.f. 

The immediate consequence is that the previous r th 
moment fi r and r th cumulant k t become k r p r and k r K r . 

The reason for the older name seminvariant is now seen ; 
for under a change of origin and scale in x all cumulants after 
k x are altered at most by a scale factor. 

Change of scale in the f.m.g.f. will be effected by 
replacing 1+a by (1+a)*. 


Example. The first moment or mean of the distribution 
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which has g.f. (pt-\-q) n is np, and the m.g.f. with respect to 
the mean as origin is e~ n P a (pe a +q) n . The corresponding 
f.m.g.f. is (1 + a)“ w * (1 +pa) n . 

12. Population, Universal, Universe or Stock, 
Sample. To conclude these questions of nomenclature 
and general notions we explain what is meant by popula¬ 
tion, universe or stock , and sample . As an example let 
us consider the repetition of an experiment in which the 
probability of success is p = mjn, a rational fraction. We 
may construct a model by taking (or imagining) n similar 
objects, such as equal spherical marbles, of which m are 
distinguishable from the rest, and drawing an object 
repeatedly, with replacement after each drawing. Such 
an assemblage, actual or hypothetical, constitutes a 
population, universe or stock. It is in fact merely a model 
of the system S. To cope with special cases we have 
often to conceive a fictitious infinite population. For 
example, if we wish to represent drawing with replacement 
by a model in which the drawing is without replacement, 
the population of the model will certainly have to be 
infinite, since the probabilities of successive drawings are 
constant, a thing which cannot happen with a finite 
population. 

Sample. Any element of a population is a sample 
of that population. For example, if five drawings are 
made, with replacement each time, from six cards numbered 
1, 2, 3, 4, 5, 6, the population of possible sets of five cards 
contains 6 6 or 7776 elements, of which (3, 5, 5, 4, 1) and 
(4, 4, 2, 6, 3) are two samples. If the drawing is without 
replacement, the population of sets of five contains 
6.5.4.3.2 or 720 elements, of which (2, 3, 5, 6, 4) and 
(5, 2, 4, 3, 1) are two samples. Or again, if a coin is spun 
100 times, the sequence of heads and tails arising is to be 
regarded as one sample out of the possible 2 100 sequences 
constituting the population of sequences. 

The word “ sample ” is also used as a verb, “ to sample ” 
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a population meaning to draw a sample, or samples, from 
that population. 

Notation. It is important to distinguish the pro¬ 
bability <f>(x) 9 which may not be definitely known, from 
the relative frequency of x as found in a sample, let us 
say f(x ); and in the same way all parameters, such as 
means and moments, connected with </>(x) should be 
distinguished from the corresponding parameters in the 
case of f(x). As far as possible we shall make this distinc¬ 
tion by using Greek letters for probability functions and 
parameters, italic letters for the corresponding frequency 
functions and parameters. Thus if /z' stands for the r th 
moment of <f)(x) 9 then m r will be the r ih moment of f(x) ; 
and so on. 

For detailed description of many aspects of theoretical 
and practical statistics, and for bibliographical references to 
memoirs and texts on the subject, the reader may consult 
An Introduction to the Theory of Statistics , by G. U. Yule 
and M. G. Kendall, London, 1937, the 11th edition of the 
original book by the first-named author. 

For an account of moments, factorial moments and 
cumulants, Chapter 3 of M. G. Kendall’s The Advanced 
Theory of Statistics , London, 1943, may be consulted. 

In this book, to cover the topics within a limited 
space, we have made a systematic use of moment generating 
functions. In strictness a mathematical preamble would be 
required, setting out the conditions under which such integral 
transforms exist, and the conditions under which they may 
be uniquely reciprocated to the probability functions. The 
student intending to read advanced statistics will be well 
advised to gain as much preliminary knowledge as possible 
concerning Laplace and Fourier transforms and their inversion. 
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PROBABILITY AND FREQUENCY DISTRIBUTIONS : 
GRAPHICAL REPRESENTATION : CALCULATION 
OF MOMENTS 

13. Distributions, Probability Curve, Histogram. 
The assemblage of values of probabilities (f>(x) i for all the 
possible values x i of x that may occur in any system S, 
is called the probability distribution of x in S. In practice 
a set of n observations in a sample does not usually give 
all the possible values x j9 and certainly cannot give them 
all if they cover a continuous range. Further, the sample 
of n values is itself only one member of the population, 
often prodigiously large or even infinite, of possible samples 
of n values that might have been drawn. 

The relative frequency of x i in a sample of n values is 
denoted by f(xf). The assemblage of relative frequencies 
f(Xj) for the sample is then called the frequency distribution 
of x in that sample. The name is also often given to the 
assemblage of absolute or actual frequencies, but these are 
merely obtained by multiplying all relative frequencies 
by n. 

Ex. 1. In repeated throws of a symmetrical coin the 
respective probabilities of runs of 1 head, 2 heads, 3 heads, 
... are Ar> ••• • Hence in 400 throws the ideal 

probability distribution may be tabulated (to the nearest 
integer) as : 

x 1 2 345678 Total 
n</> 50 25 13 6 3 2 1 0 100 

In an actual experiment of 400 throws (performed by the 
26 



SYMMETRY AND SKEWNESS 


27 


author) there were 196 heads, and the frequency distribution 
distributions of runs of x heads was : 

x 1 2 346678 Total 
nf 61 24 14 4 6 1 0 1 100 

Comparing the actual with the theoretical distribution, 
the reader will note a fairly close agreement, and also a slight 
irregularity in the frequencies. 

If x is a continuous variate, the curve y = <f>(x) is 
called the 'probability curve of x . (The term “ frequency 
curve ” will often be found, but it is not strictly accurate. 
Cf. 12.) The curve may be symmetrical about its central 
ordinate ; or it may have the “ long tail ” to the positive 
or right side, in which case it is said to be positively skew ; 
or to the negative or left side, in which case it is negatively 
skew. In some cases, as in the probabilities of runs of 
heads just considered, the curve may not descend at all 
on one side or the other. A curve so extremely skew is 
called positively J-shaped, or negatively J-shaped, as the 
case may be. In a rare type of distribution called the 
U-shaped curve the minimum ordinate is in the middle 
region. The area under a probability curve measures the 
total probability of all possible values of x, and is therefore 


equal to 1. 



Neg. J-shaped. 

rx 

Ncg. skew. 

Symmetrical. 

Pob. skew. 

Pob. J-shaped. 

U-shaped. 


If a; is a discontinuous variate the plotted points (x, y) 9 
where y = <f>(%), do not form a curve. The sum of the 
ordinates is equal to 1. It is customary, though there 
is no very cogent reason for doing so, to join these points 
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each to its neighbour by straight lines, thus obtaining the 
probability polygon for the distribution in question. The 
terms “symmetry” and “skewness” then have corre¬ 
sponding meanings. 

Frequency Polygon, Histogram. In an actual 
sample of observations we have relative frequencies instead 
of probabilities. If the variate x is discontinuous, as for 
example the number of flowers on stalks, the number of 
beans in bean-pods, we obtain separate plotted points 
(x,f(x)) which, joined to their neighbours, form a frequency 
polygon . 


Frequency Polygon. Histogram. 

On the other hand, x may be a continuous variate, the 
range of which in the process of measurement is broken 
for convenience into intervals of finite breadth. For 
example, height of men, measured in inches, is a continuous 
variate ; all heights within a certain range are conceivable. 
But in practice heights may be recorded to the nearest 
inch, in which case all individuals of the sample having 
heights in the range 66*5000... to 67*4999... inches form 
a frequency group or frequency class corresponding to 
x = 67, the central point of the class. In such a case 
it is customary to represent the class graphically not by 
a single ordinate at the central point but by a rectangle 
on the class-interval (as 66*5 to 67*5) as base and of area 
proportional to the class frequency or relative frequency 
f(x). The figure of juxtaposed rectangles is then called 
the frequency histogram or simply the histogram (that is, 
diagram made up of cells) , and it furnishes a rough 
approximation to the ideal probability curve. 

Ex. 2. Plot the probability polygon for the runs of heads 
in Ex. 1 ; also the frequency polygon of the experiment. 

Ex. 3. Note that often great care must be taken to 
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ascertain the exact class-boundaries and centres of classes. 
For example, the British Anthropometrio Committee ( Report , 
1883, p. 256) measured the height of 8585 adult males in the 
British Isles, made up of samples of 6194 from England, 
1304 from Scotland, 741 from Wales and 346 from Ireland. 
The distribution of the Irish sample reads as follows : 

x 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 n 

nf 1 0 2 2 7 15 33 58 73 62 40 25 15 10 3 346 

When we are told, however, that the class x = 59 inches 
means “ 59 and over,” but at the same time that measurements 
were to the nearest eighth of an inch, it appears that class 
x = 59 means from x = 58}-J to 59|$, so that the centre of 
the class is at x = 59^ ; and so for every other class. 

The reader should draw the histogram for the above 
distribution, choosing not too small a scale for the 
frequency. 

For ease and rapidity in computation we can always 
by a change of origin take any convenient value of x as 
new origin, and by a change of scale make class intervals 
of unit breadth. At the end of any calculations we can 
translate the results back to the proper origin and scale. 
It is often convenient to choose a provisional origin either 
near the middle values of x or at one or other end of the 
range. 

Ex. 4. In the distribution of Ex. 3, if 67 is taken as new 
origin for x , the classes range from x = -8 to x = +6. If 
these classes are presumed to be centred, the origin is not 
67 but 67^. 

14. Descriptive Parameters of Distribution. Pro¬ 
bability and frequency distributions may be described, 
not completely, but in their main features, by the values 
of their moments, factorial moments or other parameters. 
Some of these parameters have a geometrical significance. 

Typical Parameters or Averages. There are three 
of these in common use, the mode , the median and the 
arithmetic mean . 
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Mode. The mode is the value of x for which the 
probability (j>(x), or in a frequency distribution the relative 
frequency f(x), is a maximum , that is, greater than the 
probability (or frequency) on either side. In a probability 
curve it is the abscissa of a maximal ordinate. 

Many curves have a single maximum near the middle; 
others may show two maxima or more. These are called 
dimodal or multimodal , as the case may be. 

/iv /tx^ 

Mode. Dimodal Curve. Median. 

Median. The median is that value of x which divides 
the sum or integral of the probabilities over the whole 
range into two equal parts. This sum or integral must 
be equal to 1 ; and so if the range of values of x is from 
x = a to x = b t the median value of x is defined by 

h(*) = = *, . . (i) 

ax a 

Cx rb Cb 

or I (j)(x)dx — (j)(z)dx = £ <j>(x)dx — . (2) 

J a J x J a 

For a continuous probability curve the median ordinate, 
by (2), bisects the area under the curve. 

Arithmetic Mean. The most widely used typical 
measure is the arithmetic mean, which is simply the 
■ first moment or mathematical expectation of x, namely 

fi\ = 2Jx<f)(x) or jx<f>(x)dx. . . (3) 

These formulae are the same as those occurring in 
dynamics for the centroid of a series of particles of masses 
<f>{Xj) placed at points x i along a straight line, and the 
centroid of a straight rod of density <f>(x) at the point x . 
It follows that the arithmetic mean is the abscissa of the 
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ordinate through the centroid of the area under the curve 

y = <£(*)• 

The arithmetic mean of the values x i in a sample is 
correspondingly m[~ Zxf(x). 

Remark . In many probability curves of slight or moderate 
skewness the median lies between the mode and the arithmetic 
mean, nearly twice as far from the mode as from the mean. 

Moments about the Mean. The arithmetic mean is 
so fundamental in theory and in practice that it is 
customary, once it has been determined, to take it as a 
new origin and to refer all higher moments to this origin. 
Moments about th§ mean as origin are usually denoted 
by undashed /x r . We find easily, by binomial expansion, 

fji r = E(x—ii\) r (j>(x) or j(x—ii 1 ) r (j)(x)dx 

= Mr- r ^l^r-l+ r (2)(^i) Vr —2- — 
+(“) r -M/*;r l fh+(“) r (/h) r . • • w 

where r (8) denotes the familiar binomial coefficient 
r(r— l)...(r—s+l)/s !. The last two terms can be merged 
into one as (—) f-1 (r—l)(/x 1 ) r . For example: 

Mi = °; 

M2 = 

M 3 = M3“Vl^2+ 2 (/l) 3 » 

M* = ^4“ 4 / x l/ x 3+ 6 (/ X l) V2“ 3 (/^l) 4 » • (5) 

formulae of regular application in practical work, since 
they hold equally well for moments of a frequency dis¬ 
tribution, <f>(x) being then replaced by f(x), and ju, by m. 

Other means, such as the geometric and the harmonic 
mean, are very occasionally used with respect to rather 
special distributions. 

Cumulants in Terms of Moments about the 
Mean. By expanding the logarithm of 

M( a) = e^ ia (l +^z 2 a 2 /2! +/x 8 a 3 /3!+...) 
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as a series in powers of a, and comparing the coefficients 
of these with the coefficients in 10 (7), we find the rela¬ 
tions between cumulants and moments about the mean. 
The first four relations are 

*1 = flu K 2 = /*2> *3 = ^3> *4 = fl 4 — 

15. Measures of Dispersion or Spread. Distribu¬ 
tions differ according as the values of x are spread densely 
or widely on either side of the mean. To describe this 
feature numerically we need parameters measuring 
dispersion. 

The arithmetic mean of the deviations Xj—fi x from the 
mean is of course of no use for the purpose, being equal to 
zero. A measure occasionally used, but now falling into 
disuse, is the mean absolute deviation (the former name was 
“ mean error ”) defined by the arithmetic mean of devia¬ 
tions from the mean all taken with positive sign , namely 

Slx-fi^x) or J \z—n[\<f>(x)dx, . . (1) 

where denotes the positive numerical, or absolute , 

value of x—p! v 

Though usually computed with respect to p! v it is 
actually in closer association with the median, in virtue 
of a certain minimal property, namely : 

The median value of x is such that the sum of the absolute 
deviations from it } Z T |x~x i |, is a minimum. 

The median of a discrete set of values x j needs more 
precise definition. If an odd number of values is ranged 
in monotonic order x % , x v ..., x 2n so that each x M ^x jy 
we shall define the median as the middle value, x n . If 
an even number of values is so arranged as x 0 , x l9 ..., 
x 2n - 1 , we shall say that the median is any value of x in 
the middle interval, that is, is such that 
The minimal property may then be proved as follows : 

(a) Let there be 2n+ 1 values a? 0 -, x lt ..., x 2n> and let 
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us call the interval between and x i inclusive the j th 
interval. The median is at x = x n . Let us denote by 
£(x) the sum Z\x—x s \ of absolute deviations from any x. 

First consider S(x) as compared with S(x n ), where x 
is in the {n+\) th interval, on the right of the median, 
and x—x n = h. Then the absolute deviations of the n +1 
values x 0 , x lf ..., x n on the one side have each been 
increased by h , while those of the n values x n+li x n+2 , 
..., x 2n on the other have each been decreased by h. 
Hence in this interval 

S(x)—8(x n ) = h. . (2) 

Now suppose x moves into the next interval, the 
(n+ 2) th . Comparing S(x) with $(z n+1 ), we note that if 
x—x n+1 = h the absolute deviations of the n-\- 2 values 
x 0 , x v ..., x n+1 each receive an increment h, while those 
of the remaining n — 1 values receive a decrement h. 
Hence in this interval 

S(x)-S(x n+1 ) = 3 h. . . (3) 

In this way S(x) increases as x moves through successive 
intervals to the right, the increments which it receives 
within the intervals being A, 3 h, 5h , ..., (2n — l)h; and 
by symmetry, or by a similar proof, S(x) receives corre¬ 
sponding increments as x moves through successive intervals 
to the left of x n . 

Hence S(x) is a minimum for x = x n . 

(b) Let there be 2n values x 0 , a^, ..., x 2n - 1 . 

The reader will see at once that if x lies in the central 
interval, the n ih interval, and if within that interval is 
displaced by an amount h> then n absolute deviations on 
the one side each receive an increment h> while n on the 
other each receive a decrement h. Hence S(x) is constant 
within the central interval. 

Also, as x moves out of the central interval either to 
right or to left through successive intervals, S(x) receives 
. o 
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the respective increments 2h t 4A, (2n~~2)h. Hence 

8(x) is a minimum within the central interval. 

(c) The result for a continuous variate x can be proved 
as a limiting case of (a) and (6), or else directly thus : 

Let (—a, b) be the range of values of x , the median being 
taken as the origin x = 0, so that 

/•0 pb 

I <f>(x)dx = <j)(x)dx = . . (4) 

J -a JO 


The integral S(h) of absolute deviations from x = h, 
h> 0, is then 


8{h) = f (h—x)<f)(x)dx-\- f (x—h)cf)(x)dx 
J -a J h 

= [J + J o ] (h-X)<f>(x)dx + [ J o - JJ ( x -h)<f>(x)dx, (5) 


whereas 

S(0) = f — (x)<f>(x)dx+ f x(f)(x)dx. . 

J~o J o 

Hence 

8(h)-S( 0) = € (f>(x)dx—j {h—x)<f>(x)dx 

= 2 J (h—x)(f)(x)dx > . . . . ( 


( 6 ) 


and this is essentially positive, since <j>(x) is a positive 
function. The same result may be proved to hold for 
&<0, and so £(0) is a minimum. 

Note . The indeterminacy of the median of an even number 
of discrete values matters exceedingly little in practice, 
the two middle values being for the most part indistinguishably 
close. 


The Quartiles. The median ordinate halves the 
distribution. Halving again the two halves, we may find 
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values of x which are called the quartile measures. For 
discrete distributions, they lie one-quarter and three- 
quarters along the line of values x jf supposed arranged 
in ascending order. For continuous distributions of range 
x = a to x = b they are values q l9 q z (it is hardly worth 
while here to press further Greek letters into service) 
such that 

| <j>(x)dx = j (f>(x)dx = J. . . (8) 

J a J q. 

The median might be regarded as a middle quartile q 2 , 
the other two are called the upper and lower quartiles. 
The value of \{q z ~qi) furnishes a measure of dispersion 
called the semi-interquartile range. Any value of x has an 
a priori probability of \ of being such that q x <x<q z ; 
it is as likely to be inside the range as outside it. For 
this reason, in the theory of errors, this particular measure 
of dispersion has long been called the probable error of 
the distribution. The name is very misleading, since there 
is nothing specially probable about this particular devia¬ 
tion ; and of late there has been a salutary tendency to 
supersede the so-called probable error by the standard 
deviation, which we now define. 

Standard Deviation. The arithmetic mean of the 
squared deviations (x—p]) 2 from the mean, that is, the 
second moment /x 2 , is obviously a suitable measure of 
dispersion. The square root of this, y/JT 2 , formerly called 
the root-mean-square deviation, is now called the standard 
deviation and is denoted by or. The sample value is 
denoted by 8 . Thus a 2 = (i 2 , s 2 = m 2 . 

Variance. Modern usage is tending more and more to 
treat or a 2 itself, rather than a, as a suitable measure of 
dispersion, under the name of variance . We have therefore 

a 2 = Z(x—p! 1 ) 2 (j)(x) or J [x—p!^) 2 (f){x)dx, . (9) 

while 8 2 = Z(x—mi) 2 f(x) .(10) 
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The standard deviation has also a minimal property, 
with respect to the arithmetic mean, namely : 

The sum or mean of squared deviations is a minimum 
when taken with respect to the arithmetic mean. 

This fact is obvious at once from the formula of 14 (5) 

fa “ P2 (pi ) 2 

which shows that /x 2 can never exceed p,' 2 . 

Mean and Variance of Linear Function. If we 
distinguish the respective means and variances of three 
independent variates x , y, z by triple suffixes, thus, 
Pioo> Poio> Pool an d p 2 oo> Po 2 o> Poo 2 > fhen from the properties 
of cumulants (11) the linear function ax-\-by-\-cz has 

mean ®/4oo+ 6 /%o+ c /4u> 
variance aV 2 oo+ 6 Vo 2 o+ c Voo 2 > 
and similarly for a general linear function in any number 
of independent variates. 

Range, Extremes. Other indications of the disper¬ 
sion of a distribution are given by the size of the range 
of x itself, b—a, as well as by the highest value, 6, or 
lowest value, a . 


16. Measures of Asymmetry or Skewness. When 
the mean is taken as origin x = 0, it may happen that 
<f>(x) = <f>(—x), so that the distribution is symmetrical. 

Ex. 1 . The distribution of number of heads in a throw of 
n symmetrical coins, described by the g.f. is sym¬ 

metrical about x = in. 


Ex. 2. The continuous distribution described by 

dp = 4= e-* x - a) 'dx 

* V2tt 

is symmetrical about x — a. 


Ex. 3. The distribution given 
dp =- 


is symmetrical about x = 0. 


1 

IT 1 4 "^* 
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Lack of symmetry, skewness, is revealed functionally 
or numerically in various ways. 

Various Measures of Skewness. In a symmetrical 
distribution the distances of the quartiles q x and g 3 from 
the median q 2 will be equal. In a skew distribution the 
difference between these distances gives a coefficient of 
skewness, namely 

)}/o = (? 3 - 2 ? 2 +? i )/°-> 

the division by a being for the purpose of removing 
arbitrary units of scale and obtaining an absolute coefficient. 

A natural measure of skewness is however the third 
moment about the mean, fju 3 . If the distribution is sym¬ 
metrical ^3 = 0 . If the long tail of the distribution is 
on the side of the positive values of x, the cubes of positive 
values of x outweigh the cubes of negative values, so that 
jjl 3 is positive, and we have positive skewness. In the 
same way if the long tail of the curve is on the side of 
the negative values of x , then /x 3 is negative, and we have 
negative skewness. 

To remove arbitrary units of measure, since /* 3 is of 
the dimensions of x 3 , or of a 3 , we construct an absolute 
measure of skewness by dividing by a 3 , that is by /x| /2 . 
The square of this, p|///|> * 8 often denoted by fi v 

Another measure of skewness (due to K. Pearson) 
depends on the fact that in a skew curve the mean, median 
and mode are not the same. The measure in question is 
defined by 

(Mean—Mode)/(Standard Deviation). 

Like ft 3 it is positive for positive skewness, zero for 
symmetry, negative for negative skewness. 

Skewness of Linear Function. If the 3rd moments 
of independent variates x , y, z about their means are /x 300 , 
fi 030 , /x 003 respectively, the 3rd moment of ax+by+cz about 
its mean is 


a Vsoo+^Voao+ c Voo8 > 
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and similarly for linear functions of any number of in¬ 
dependent variates. This follows (14) because p, 3 = k 3 . 

17. Measure of Flattening or Excess, Kurtosis. 

Two distributions may have the same mean, the same 
standard deviation, the same skewness, and yet may 
differ in that the curve of the one may be more flattened 
at the centre (platykurtic) than that of the other. 



The degree of flattening is suitably measured by the 
4th moment about the mean, Removing arbitrary 

units of measure, just as in the case of f} v we obtain the 
coefficient fiJn |, often denoted by jS 2 . It has been observed 
in an extensive class of probability curves, with scale chosen 
so that the varianoe is unity, that the ordinate at the 
mean or mode is greater or less according as /? 2 itself is 
greater or less. Thus the value of j8 2 serves to indicate 
whether the curve is tall and slim at the centre (leptokurtic) 
or squat (platykurtic). In the very important normal 
probability curve, which we shall meet in 32, the value of 
jS 2 is 3. Hence j8 2 —3 is sometimes called the excess, curves 
for which j8 2 <3 being platykurtic, those for which j3 2 >3 
being leptokurtic, the normal curve being taken as 
standard. 

Higher Moments. No simple geometrical inter¬ 
pretation attaches to parameters expressed by moments 
iL r or m r higher than the 4th, except of course that the 
moments of even order might be regarded as further 
measures of dispersion, and those of odd order as further 
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measures of skewness. These higher moments are in any 
case very seldom used in practice for frequency distributions, 
because being computed from values of x liable to random 
irregularity, “ error ” as it is usually called, they may be 
subject to very great error owing to the raising of some 
abnormally frequent large deviation x to a high power. 
This will be apparent when we come to consider the 
sampling error of coefficients, in Chapter VII. 

18. Practical Computation of Moments. The 

initial stages of the analysis of frequency distributions 
almost always involve the computation of ordinary or 
factorial moments. In the case of a continuous variate 
artificially grouped (13) into classes, a certain error is 
introduced into the moments by the centring of class- 
frequencies about the centre of the class. The calculated 
moments then require adjustment by formulae of rather 
wide application called Sheppard's Corrections . 

The example on page 40 shows the computation of the 
first four moments and the coefficients of dispersion and 
excess, for a frequency distribution. The column headings 
explain themselves. It will be observed that transference 
is made to the more convenient provisional mean x = 67, 
this being judged by inspection of the distribution to be 
somewhere near the true mean. 

Sheppard’s corrections have not been used ; we shall 
allude to this example when we come to discuss them. 
As for the mean height of the group, the provisional origin 
is really, as we saw earlier, 67 tV, or 67-44 inches. Hence 
the mean height is 67-44+0-34 = 67-78 inches. 

The distribution shows a slight negative skewness. 
Whether this is a genuine effect or due to the irregularities 
of sampling cannot be decided until we know more about 
the probability distributions of coefficients calculated from 
samples. 

The reader should verify that the sample estimates of 
and p 2 are 0-0014 and 3-56. 
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Example. The distribution of heights of adult Irishmen. 


X 

nf 

X 

nfx 

nfx 2 

nfx 8 

nfx 4 

59 

1 

-8 

-8 

64 

-512 

4096 

60 

0 

-7 

0 

0 

0 

0 

61 

2 

-6 

-12 

72 

-432 

2592 

62 

2 

-5 

-10 

50 

-250 

1250 

63 

7 

-4 

-28 

112 

-448 

1792 

64 

15 

-3 

-45 

135 

-405 

1215 

65 

33 

-2 

-66 

132 

-264 

528 

66 

58 

-1 

-58 

58 

-58 

58 

->67 

73 

0 ( 

-227) 0 

0 

(-2369) 0 

0 

68 

62 

1 

62 

62 

62 

62 

69 

40 

2 

80 

160 

320 

640 

70 

25 

3 

75 

225 

675 

2025 

71 

15 

4 

60 

240 

960 

3840 

72 

10 

5 

50 

250 

1250 

6250 

73 

3 

6 

18 

108 

648 

3888 

346 

' = 0*341+67. 

, = 4*821 -(0*341)* 

(345) 

) 118 

0-341 

= 4*705. 

) 1668 

4*821 

(3915) 

) 1546 

4*468 

)28236 

81*61 


ms = 4*468-3(0*341)(4*821)+2(0-341)» = -0-385. 

m 4 - 81*61—4(0*341)(4*468)-b6(0-341) 2 (4-821)—3(0*341) 4 = 78*84. 


19. Computation of Moments by Repeated Sum¬ 
mation. If the origin of a distribution be taken at either 
i end, preferably at the lower end, factorial moments can 
be computed by a process of repeated summation. We 
sum frequencies in columns from the remote value of x 
towards the origin, in the manner exemplified below. 
The leading sum in each column is one step lower than the 
leading sum in the preceding column. 
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Ex. 1. The same distribution, 

with origin at x 

= 59. 

a; 

nf 

E 

E 2 

E 8 

Z 4 


59 -f- 0 

1 

346 





1 

0 

345 

2886 




2 

2 

345 

2541 

11407 



3 

2 

343 

2196 

8866 

28343 


4 

7 

341 

1853 

6670 

19477 

49757 

5 

15 

334 

1512 

4817 

12807 

30280 

6 

33 

319 

1178 

3305 

7990 

17473 

7 

58 

286 

859 

2127 

4685 

9483 

8 

73 

228 

573 

1268 

2558 

4798 

9 

62 

155 

345 

695 

1290 

2240 

10 

40 

93 

190 

350 

595 

950 

11 

25 

53 

97 

160 

245 

355 

12 

15 

28 

44 

63 

85 

110 

13 

10 

13 

16 

19 

22 

25 

14 

3 

3 

3 

3 

3 

3 


r ! 

1 

1 

2 

6 

24 


The successive sums at the heads of columns may be proved 
(Appendix 2) to be equal to nm^/r\. We have therefore 

n = 346, = 2886, nra' 2) = 22814, nm|,^ = 170058, 

nm' — 1194168. 

( 4 ) 

Transforming to ordinary moments m' by the relations 
(Appendix 3) 

m i = m ay 

m 2 = W (2)+ m (l). (1) 

m 8 = m ;3) + 3m (2) +m (i)’ 

= w (4) + 6w (3) + 7m (2) + ” , (l)’ 
we obtain m' x = 2886/346 = 8-34104, 
m' £ = 25700/346 =74-2775, 
wig = 241386/346 = 697-647, 

= 2377100/346 =6870-23, 

from which, by adjusting (14 (5)) to moments about the 
mean, we derive 

m 2 = 4-705, m, - —0-386, m 4 = 78-87. 
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Now it may be noted that some advantage has been 
lost through the large numbers that arise in summations 
from end to end. Even though six significant digits have 
been retained throughout, the final results are very slightly 
discrepant with those computed by the other method. To 
obviate these disadvantages (which are not serious when a 
calculating machine is available) one may use either (i) fac¬ 
torial moments obtained by summation from both ends 
towards an origin near the centre, or (ii) central and mean 
central factorial moments obtained by a slight modification 
of this summation. 

Ex. 2. Ordinary factorial moments. Origin at x = 67. 


nf 

Z 

Z 2 

Z 2 

Z 4 

Z 6 

1 

1 

1 

1 

1 

1 

0 

1 

2 

3 

4 

5 

2 

3 

5 

8 

12 

17 

2 

5 

10 

18 

30 

47 

7 

12 

22 

40 

70 

117 

15 

27 

49 

89 

159 

276 

33 

60 

109 

198 

357 

633 

58 

118 

227 

425 

782 

1415 

73 

228 





62 

155 

345 




40 

93 

190 

350 



25 

63 

97 

160 

245 


15 

28 

44 

63 

85 

110 

10 

13 

16 

19 

22 

25 

3 

3 

3 

3 

3 

3 

r! 

1 

1 

2 

6 

24 


From the italicized entries we obtain 
n = 228 + 118 = 346, 
nm' i) = 346-227 = 118, 
nm (2) = (360+426)2 = 1650, 
nm = (245-782)6 = -3222, 
wm <4) = (110+1415)24 = 36600. 
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These may be transformed to ordinary moments m' by 
the same relations as before, yielding 

m' = 118/346 = 0-341, m' 2 = 1668/346 = 4-821, 

m B = 1546/346 = 4-468, = 28236/346 = 81-61. 

These are the same values as were found by the first method. 

Ex. 3. Central and mean central factorial moments, with 
the same origin x = 67. 

Here we again sum towards the centre from the ends, but 
each alternate sum (shown bracketed and italicized) involves 
the adding of only half the last summand in the preceding 
column, while the last sums in the other columns step 
successively away from the centre, as shown. 


nf 

2 

2* 

2 3 

2 4 

2* 

1 

1 

1 

1 

1 

1 

0 

1 

2 

3 

4 

5 

2 

3 

5 

8 

12 

17 

2 

5 

10 

18 

30 

47 

7 

12 

22 

40 

70 

117 

15 

27 

49 

89 

159 

276 

33 

60 

109 

198 

357 

(454-5) 

58 

118 

(154-5) 

227 

(311-5) 



73 

(191-5) 





62 

155 

345 

(522-5) 



40 

93 

190 

350 

595 

(652-5) 

25 

53 

97 

160 

245 

355 

15 

28 

44 

63 

85 

110 

10 

13 

16 

19 

22 

25 

3 

3 

3 

3 

3 

3 

346 

r! 1 

1 

2 

6 

24 


From the italicized entries we obtain the central factorial 
moments 

n = 191-5 + 154-5 = 346, 
nm {1} = 345-227 = 118, 
nm = (522-5+311-5)2 = 1668, 
nm = (595-357)6 = 1428, 
nm| 4) = (652-5+454-5)24 = 26568. 
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The formulas for the m' in terms of the m' f ^ are rather 
simple (Appendix 3). We have 

m\ = m' = 118/346 = 0-341, 
m 2 = m {2) “ 1668 / 346 = 4-821, 
mg = = (1428 + 118)/346 = 4-468, 

m \ = m {4)+ w { 2 } = (26568+ 1668)/346 = 81-61, 

as before. The moments about the mean can now be found. 
Alternatively, we could easily derive formulae transforming the 
moments and transferring them to the mean in one step. 

20. Sheppard’s Corrections for Grouped Moments. 
As mentioned earlier, when a continuous distribution has 
been grouped into centred classes for convenience, the 
moments require adjustment or correction because of this 
artificial grouping. The necessary formulae of correction 
were found by W. F. Sheppard. 

Naturally the problem for perfectly general functions 
</>(x) is too broad, and it is necessary to impose conditions. 
Sheppard considered the case where was such that 
the derivatives <f>'(x ), </)*(x), ... vanished in succession at 
the boundaries x = a and x = b to such an order that 

J* x r (j> ia) (x)dx = wExfoWfaj) . . (1) 

to a sufficient degree of accuracy, where w is the class- 
breadth and Xj the centre of a typical class ; that is to 
say, the error committed should be negligible compared 
with sampling errors. 

Remark . The relation between an integral and a sum of 
equidistant ordinates of the kind here considered enters into 
pure mathematics in the Euler-Maclaurin summation formula, 
by which a sum of ordinates is expressed as an integral over 
the range plus corrective terms involving the derivatives of 
odd order taken at the boundaries. In many cases, where 
the derivatives ... are not absolutely zero but 

converge to zero as a limit, the representation of the integral 
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on the left of ( 1 ) by the sum on the right needs very careful 
investigation. It is found, however, that for the statistical 
functions to which Sheppard’s corrections are usually applied 
the difference between the integral and the sum can be made 
ne gligibly small by taking values of the class-interval w of a 
size quite customary in practice. Usually it is enough for w 
to be less than the standard deviation. The derivation of 
the formulae given below must be regarded as approximate 
only. 

Ex. 1. The following two comparisons of integral with 
sum over an infinite range are interesting in this respect: 


whereas 



7 r — 3-14159 nearly. 


oo l 

E ——- = 7T eoth 7 T = 3-153348... 

-ool+.'T 2 


x taking the values 0 , ± 1 , ±2, ... . Again, 

e~l x 'dx = V'2i t = 2-506628275 nearly, 


/: 


whereas 


27 e-1** = 2-506628288 nearly. 


x taking the same values as before. The first sum in these 
examples is only moderately close to the corresponding 
integral; the second is very close, and still closer results are 
obtained if a summation with a finer subdivision of x is used. 


Suppose the range 6 —a divided into n class-intervals 
(x—\w y x-\-\w) y so that 6— a = nw . If the probability in 
the j th class is 



then the r th moment calculated from the grouped classes is 


fi r — E x'jPf, . . . ( 3 ) 

i-i 
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whereas the true moment is 

f b 

Pr 

Now 

rx f +kw 


' r = J x r <f>(x)dx. 


Pi 


r*r 
J %r 


<f>(x)dx 


±U) 




= f (f>(xj+x)dx 

J -\w 

= f {(f>(x j )+x<f>'(x j )+x t (l>’(x j )/2]+...}dx 

J -ia 

= uxf>(x j )+w 3 <fi'(x J )/24-l-w 5 <f> IT (x i )/1920+..., 


W 


(6) 


provided that this series in powers of-txrvon verges. 

Hence 
/*; = Exp, 

= J xr<t>(x)dx+^ J X r <t>'(x)dx+^jX^(x)dx+.. (6) 

in view of (1). Integrating by parts and using the fact 
that derivatives vanish at the boundaries, we have 

* tv ^ , 

Mr= f t r+24 r ( r—1 )/ x r-2+ 1 1^7/( r '— 1 )( r —2)(r—3)^-4+ ... (7) 


1920 


If moments jl t about the mean are taken, we have 
therefore the relations (where fi r means the 7 th moment of 
the grouped classes about the mean): 

Mo = Mo = l > 

Mi = Mi = °» 

/4 = m 2 +^V°=m 2 + 1 V 8 * • • < 8 ) 

Ms = Ma+^’Vi = Ms> 
Mi=M4+^ 2 M 2 +^ 4 >- 
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which, on being solved for the fjL ri yield 
= P: l = °> 

M2 = #4~ jTf*’ 2 ’ • ' • * ( 9 ) 

Ms ^ M3» 

1,7 

^4 = /**- - 2 W % + 2Jo w< > 


and these are the required adjustments, Sheppard’s 
corrections. The correction to the second moment is 
especially simple and noteworthy. If the class-interval 
is taken as the unit of scale, the correction amounts to 
subtracting ^ from the grouped second moment. 

It is customary, though the practice requires more 
justification than it has ever received, to apply the same 
corrections for grouping in the case of frequency distri¬ 
butions, the presumption being that the moments thus 
corrected are a better representation of the moments of 
the underlying probability distribution. 

Ex. 2. Correcting the moments about the mean for 
grouping in the example of 18 and 19, we obtain for the 
corrected moments 

m 2 - 4*705-0*083 = 4*622. 
ra 3 - —0*385. 

= 78*84 —£(4*705) +0*029 = 76*52. 

Ex. 3. The reader should seek out for himself numerous 
examples of frequency distributions, and should acquire as 
much practice as possible in computing moments in the various 
ways exemplified above, and in correcting them. Sheppard’s 
correction will be applied in those cases in which the relative 
frequency f(x) in sample corresponds to probability (f>(x) of a 
continuous variate. 
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SPECIAL PROBABILITY DISTRIBUTIONS 

21. Distributions of Equal Probability. If n values x f 
of x , where j = 1 , 2 , n, have each equal probability 
1/n, the graph of probability consists of n ordinates of 
equal height 1/n. The case of a symmetrical coin is the 
case n = 2 , the case of an ordinary unbiassed die is the 
case n = 6. —-—' 

The Rectangular Distribution. The limiting case 
of the preceding, when n tends to infinity, yields an 
important distribution called the rectangular distribution , 
namely that in which x has an equal probability of being 
at any point in the range x = a to x = 6, a<6. The 
probability differential is then given by 

dp = -r^— dx , ... (1) 

b—a 

so that <f>(x) — 1/(6—a) and the probability curve consists 
of a rectangle on the range as base and of height 1/(6—a). 
It is always possible to choose the central point of the 
range for origin, and the unit of scale such that the range 
becomes the new range x = —£ to x ~ The rectangle 
is then a square. The moments of odd order vanish; 
those of oven order are 

= J* x r dx = — (J) r - • • (2) 

In particular = ~, so that a = l/y'12 = 0*2886... 

lZ 

Example 1. Show the m.g.f. of the standard rectangular 
distribution is (sinh Ja) '£a. 

48 
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Example 2. Tho following samples from a rectangular 
population have been arranged as frequency distributions. 
The times on 1000 watches displayed in watchmakers’ 
windows were noted by the author. The distributions are 
of the first and second 500 of these. Class x = 1 means 
the class of all watch times from 1 h. to 1 h. 69 m. to tho 
nearest minute, and classes x = 2, 3, ... 12 have a similar 
meaning. 



X 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

n 

(i) 

nf 

34 

54 

39 

49 

45 

41 

33 

37 

41 

47 

39 

41 

600 

(ii) 

nf 

47 

41 

47 

49 

45 

32 

37 

40 

41 

37 

48 

36 

500 


The mathematical expectation of the number in any class 
is 500/12, or 42 to the nearest integer. One of the classes in 
the above samples contains 54, and another contains 32. We 
shall see later that the deviations here are not extreme. 

The mathematical expectation, or mean of x in the 
population, is 6*5. The means of x in the above samples 
are 6*426 and 6*322. 

22. The Binomial Distribution. This fundamental 
distribution arises when n trials are made of a constant 
system 8 with probability p of an event E , the number x 
of successes in the n trials being the variate. The g.f. 
is ( pt+q) n , and so by binomial expansion the probability 
function, namely the coefficient of t x in the g.f., is 

= n {x) p*q"-*. . . (1) 

Moments. The f.m.g.f., obtained by putting t = 1 
is seen at once to bo (l+pa) n , so that the mean, the 
coefficient of a in this, is np. Hence the f.m.g.f. about 
the mean is (11) 

(l+a)- n *(l+pa) n 

= [(l+pa)(l—pa+p(p+l)a 2 /2\—p(p+l)(p+2)a?/3\+...)]» 
= [\+p(l-p)a' i l2\-2p(\—p)(p+l)a?l^\-\-...'\ n , . (2) 

whence fi {2 ) = npq , fx iz) = — 2npq(p+l) y so that 

p 2 = Pi 2 ) = npq, p 3 = p {3) +3p l2) = npq(q-p). (3) 

• D 
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It is readily proved in the same way, by finding fj, U) and 
hence /z 4 , that 

/*4 = npq[p 2 +(3n—±)pq+q 2 ]. . . (4) 

The formula a = \/(npq) is of fundamental importance. 

Example. The following is a sample from a binomial 
population. The Swedish astronomer and statistician, C. V. L. 
Charlier, performed 1000 times the experiment of drawing 10 
cards, one at a time with replacement after each drawing, 
from an ordinary pack, the number x of black cards in each 
set of 10 cards being the variate. Thus n = 10, p = £. He 
obtained the distribution 

# 0 1 2 3 4 5 6 789 10 # 

Nf 3 10 43 116 221 247 202 116 34 9 0 1000 

The corresponding probability distribution has g.f. 
(4~fi0 10 * Multiplying this by 1000 and recording the 
coefficients of powers of t to the nearest integer, wo obtain 

# 0 1 2 3 4 6 6 789 10 # 

N<f> 1 10 44 117 205 246 205 117 44 10 1 1000 

From Charlier’s data we find m' x = 4*933, m 2 = 2*415. 
The theoretical expectations are //,' = np = 5*00 and 
ft 2 = npq = 2*5. 

We shall consider in a later section (55) whether these 
deviations of actual experimental results from theoretical 
expectation are reasonable under the hypothesis of random 
sampling. 

23. The Binomial Distribution of Poisson. The 

ordinary binomial distribution is often called the 
Bemoullian distribution, after James Bernoulli, who first 
(in Ars Conjectandi , a work published in 1713, eight years 
after his death) investigated it in detail. S. D. Poisson 
in 1837 considered the problem of n trials, but with the 
system S varied each time so as to produce possibly 
different probabilities of success p jt where j = 1 , 2, ..., n. 
The g.f. is therefore 

(Pit+qi)(Pnt+q») — (i> n <+3»). • • (i) 
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and so the f.m.g.f. is 

(l+Pia)(l+^ 2 a) ... (l+p n a). . . . (2) 

The coefficient of a in this f.m.g.f. gives us the mean, or 
mathematical expectation of the number of successes, as 
Pi+^ 2 + -- +2V Let us write 

UP = Pl+P2 + -~+Pn> • • ( 3 ) 

in order that we may later compare the moments with 
those of a Bernoullian distribution with the same mean 
probability j>, and so characterized by the g.f. ( pt+q ) n . 
The Poisson f.m.g.f. about the mean is (compare the details 
of 22 (2)) 

77(1 +a)~*>(l +pp) (77 == product) 

j 

= n[l+PW 2 W-2p jqj ( Pt +l)a*l3\ h~] 

= l+Zp j q j a 2 l2l—2i:p j q j (p j +l)a a /‘M + ... . ( 4 ) 

Hence /i (2) = /x 2 = Epfij, and /x (3) = —ZZpfl^pf+l), 
so that fi 3 =R3)+3/4 (2) • • (5) 

24. Comparison of Bernoullian and Poissonian 
Variance. It will now be proved that the Poissonian 
variance, let us say Op, is less than the Bernoullian, o\. 
At first sight this may seem surprising, for one might 
imagine that the variation of probability of success in 
trials within the experiment would increase the variance 
of x , the number of successes. If we consider, however, 
the case of extreme variation of probability, namely the 
case in which some of the trials are certain of success, 
and the rest are certain of failure, we shall see that the 
smaller variance is natural enough ; for in this extreme 
instance the value of x is constant and so its variance is 
zero. 

The fact that the Poissonian variance is less than the 
Bernoullian is valuable, for it suggests a test for the 
constancy 01 : otherwise of the system S from one trial 
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to the next, in other words, for statistical homogeneity 
within the experiment. 

As in 23, let p be the mean probability, p — Epjjn, 
We have at once, by the usual transference to the mean, 

o*=Z{Pi-p) 2 ln =Zp*ln-p i , . . ( 1 ) 

where a 2 is the variance of probability in the n trials. 
Hence 

Spa, = Sp,(\-p,) 

= np—np 2 —2(pj—p) 2 
= npq—S(p,—p) 2 . . . ( 2 ) 

that is, = a^—na 2 . . . . (3) 

This result shows not only that the Poissonian variance 
is less than the Bernoullian, but by how much it is less. 

25. The Lexian Distribution. The extension made 
by Poisson to the Bernoullian scheme consisted in varying 
the probability of success among the n trials, but within 
the experiment. A different kind of extension was con¬ 
sidered by the German economist, W. Lexis, in 1877. 
The probability was taken by Lexis as constant in the 
n trials of one experiment, but as varying among k such 
experiments. 

Let k Bernoullian sets of n repeated trials be made, 
each with constant probability of success within the set. 
Let p t be the probability for the i th set, where i = 1, 
2 , ..., k t and let x t be the number of successes recorded 
in it. It is required to find the mean and variance of 
the distribution of the 

The sets are here mutually exclusive, and the probability 
of each, if we imagine one of the p t to be chosen and n 
trials to be then made, is 1/k. Also the f.m.g.f. of x { is 
(l+j^a)". Thus the f.m.g.f. of the Lexian distribution is 

kr^(l+ Pi a) n . . . . ( 1 ) 

t 

The coefficient of a shows that the mean is nZpJk. For 
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comparison with a repeated Bernoullian scheme let us put 
np = nEpJk. The f.m.g.f. about the mean is then 

A;- 1 (l+a)-^2 , (l+^ i a) n .(2) 

= [1— npa-\-np(np + l)a 2 l2\-{-...) 

X [1 +npa + 2 /2! +.-.] 

whence, by picking out the coefficient of a 2 /21, 

~ /*(2) 

= np (np + I)—2n 2 p 2j r n(n—l)k~ 1 [kp 2 +E(p t —p ) 2 ] 

= npq+n(n—l)E(p { -p) 2 lk, 

that is, 

o% = npq+n(n — l)oJ. . . . (3) 

Thus, whereas the Poissonian variance was less than 
the Bernoullian, we see that the Lexian variance exceeds 
the Bernoullian by an amount which increases strongly 
with n, because of the coefficient n(n— 1) in (3). 

26. Coolidge’s Extension of the Lexian Scheme. 

It is a natural extension to consider, as J. L. Coolidge 
did in 1921, the distribution which arises not from k 
Bernoullian but from k Poissonian sets, each with a different 
set of probabilities in its constituent n trials. 

Let p u be the probability of success in the j th trial of 
the i th set. Then, just as in 25, the f.m.g.f. is 

k~ 1 EII( 1 -\~PijO.). . . . (1) 

i J 

Let us write Ep {j = np i0 , Ep i0 = kp. Then the mean of 
5 i 

the distribution is evidently np. Transferring the f.m.g.f. 
to the mean, and picking out the coefficient of a 2 /2l, 
we find, after three or four lines of algebra, 

/*2 = M< 2 ) = npq+n(n-l)2(p M -p) i /k-i:2(p ii -p i0 ) 2 /k. 

. i i j 
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It is appropriate to regard the three terms of this 
expression as of Bernoullian, Lexian and Poissonian type 
respectively. Certain special cases are easily perceived ; 
for example, when p i0 — p , that is to say, when the mean 
probability in each set of trials is the same for all sets, 
a variance emerges which slightly generalises the Poissonian 
variance af>, and, like it, is less than the Bernoullian. 

An alternative form is 

n a — npq+n‘ t Z('p i ' ) -'p) i lk-ZZ(‘p ij —p)*lk, 

i i j 

which we may write as 

a 2 = ct 2 +m 2 ct 2 .-M(7 2 . . . . (2) 

This result shows that non-homogeneity, or fluctuation 
of probability, within the trials of an experiment is of far 
less effect, when n is large, than fluctuation in mean 
probability from one set to another. In fact in many 
cases or£ differs only slightly from the corresponding cr|. 

Analysis of Variance. The results which we have 
obtained for the Lexian and Coolidge schemes exhibit the 
variance as resolved into separate components of variance . 
The Bernoullian component may be called the random 
component, since it arises even when probability is con¬ 
stant, while the Lexian component may be called the 
systematic component, since it arises from the systematic 
alteration or variation of probability from one experiment 
to another. This resolution of variance into separate 
components of variance has been called analysis of variance . 
It has been greatly extended by Professor R. A.Pisher, who 
has devised regular schemes of experimental arrangement 
involving many variates, by means of which not one but 
several systematic components of variance can be isolated 
(75) from each other and from the random component. 

27. Charlier’s Criteria of Homogeneity Based on 
Dispersion. The test of homogeneity or stability con¬ 
sidered in this section would now be superseded or 



VARIATION IN PROBABILITY 


55 


amplified by modern methods of analysis of variance, 
but it is interesting in itself. 

We have approximately, in the Lexian and Coolidge 
schemes, 

a l = . . . (1) 

Hence ( a p /P ) 2 = («i—■ °%)l(p'i) z > • • • ( 2 ) 

where np , the mean of the distribution. 

Hence ajp = V(^l ■ • (3) 

Charlier denoted this by p, naming it the “ coefficient of 
perturbation n of a Lexian distribution. He turned it 
into a percentage by taking lOOp. From (3) we see that 
p measures the relative fluctuation of probability. 

Example. Classing 288,000 Swedish births in 576 sets 
of 500 each, according to different months and different 
districts, Charlier found for x, the number of male births 
in a set, 

m\ = 257*12, s L = 12*49, n = 500, k = 576. 

Hence p = rr^/n — 0*514, q — 0-486, not a priori , but as 
estimated from the large sample of 288,000 ; and so 

s B = V(npq) = V124-9 = 11*18. 

Hence 100p = 100(156*0-124-9)/257 = 2*17 per cent. 

The conclusion made is that a male birth in Sweden is an 
event of 51*4 per cent, probability, with a standard deviation 
of 51*4x0*0217, or about 1*1 per cent, probability. 

28. Types of Multinomial Distribution. The bino¬ 
mial distribution, of Bernoullian or Poissonian type, is a 
special case of the multinomial distribution, the forms of 
which are so many and so various as almost to defeat 
classification. We have seen a simple example in the 
probability distribution of totals of points in n throws of a 
die, or a single throw of n similar dice. Here the g.f., for 
biassed dice, is 

(Pit+Pd z +Pt t *+Pi t *+P« t *) n > ■ ■ (!) 

and it is best to leave the distribution in this symbolized 
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form, and not to expand by the multinomial theorem. 
The generalization to the case of n different dice, possibly 
with different numbers of faces, is easily seen. 


Ex. 1. Prove that the mean value of the total in n throws 
of a biassed die is n(p x + 2p 2 +3 p z -f 4 p x -f- 5p 6 + Qp z ) . 

Ex. 2. Find, by constructing the f.m.g.f., the variance 
and standard deviation of the total of points in a throw of 
n symmetrical six-sided dice. 

The f.m.g.f. reduces to 


-[('-■2“ + r’ /2l + -)( 1+ ? + 
- + •••)’• 


?*■/*+■••)]’ 


Hence /i 2 = 35/1/12, and so cr = V(35n)/2\/3. 


29. Sampling without Replacement, Hypergeo¬ 
metric Distribution. When in sampling a population the 
individual drawn is not replaced, the result of one drawing 
influences the probability of the next, so that the successive 
drawings are not independent, Hence it is no longer 
possible to combine into a product the g.f.’s of the separate 
drawings. It is true that the difficulty can be circum¬ 
vented by the introduction of symbolic products, with 
due precautions in expansion, but we shall here proceed 
from first principles. 

Let us consider a population of N individuals, of whom 
M = Np are of character A, so that the probability of 
drawing an A at the first drawing is p. Let n drawings 
be made, no individual drawn being replaced after the 
drawing. It is required to find the probability distribution 
of x , the number of individuals A drawn. 

The probability of x successes A, n—x failures A, 
occurring in some particular order, is 
M(M-l) ... {M-x+l){N-M)(N-M-l) ... 

(N-M-n+x+l)IN(N-l) ... (tf-w+l), . (1) 
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as is readily seen by considering how the numbers in 
population, and in categories 4 or i, are depleted by 1 
at each drawing. But there are n {x) possible orders in 
which x successes may eventuate among n drawings. 
Hence the desired probability is 

<j>(x) = n ix) M^(N~MY n -^IN^\ . (2) 

where M {x) = M(M—l)(M—2) ... (M—x+1), and so on. 

Just as the binomial probability function of 22 was a 
typical term in the binomial expansion of ( pt+q ) n , so this 
function that we have just found is a typical term in a 
certain series, a hypergeometric series. Hence <f>(x) is often 
called the hypergeometric probability function. 

The g.f. is the hypergeometric series 

E n ix) M {x) (N —M)(n-x)ix/Nin) . . (3) 

J/(n) 

= ]y<lo F (~ M ’ N-m-n+l ■ t) 


in the notation of Gauss, and so the f.m.g.f. is 

E n {x) M' x \N-MY n -*'(l+a) x IN^\ . (4) 

x = 0 


which may be evaluated (by gathering terms together in 
Vandermondian expansions) as 


i+-^r a+ — Y(iV~ a / 2! + ~^( 3 )~ a / 3! + -"> 


(5) 


= F ( —M, —n ; — N ; a), 


a terminating hypergeometric series. The mean is thus 
MnjNy and the r th factorial moment is M (r) n ir) /N ir) . 

The examples which have now been given of probability 
distributions have shown how numerous and varied are 
the types of distribution. In fact, any proposed probability 
function may be simulated by a suitably constructed model 
or population, and special samplings of this population 
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give rise to further probability functions. Fortunately, 
when the number n of trials is large, many of these pro¬ 
bability distributions tend with good approximation 
towards one or other of a few dominant types, which we 
shall now consider. 

30. Important Approximate Distributions : Types 

A and B. When the coefficients of t x in the Bernoullian 
binomial g.f. (pt+q) n are taken as probability ordinates 
y = <f>(x ), we may join the tops of the ordinates to form 
a probability polygon. If this is done for increasing 
values of n , the mean np being taken as origin and the 
standard deviation y/{npq) as unit of scale, it is found 
that the successive probability polygons tend to lose any 
initial asymmetry due to inequality of p and q. 

In fact the coefficient j3 x of skewness is 

ft = l4lf4 = [npq(q~P)?/{npq) 3 

= (s '-p) 2 l n pq> • • • (1) 

which evidently tends to zero as n increases, unless either 
of p or q is of the order of magnitude of 1 \n, let us say 
0(l/n), in which case the skewness remains appreciable. 
Not only so but, apart from the exception just mentioned, 
these binomial curves are found to cluster towards a 
limiting symmetrical shape, the same for all. The curve 
to which they thus approach asymptotically is of paramount 
importance in statistics, and is called the normal probability 
curve. It is the asymptotic shape not merely of the 
Bernoullian binomial but of the Poissonian, as well as of 
the multinomial and of many other distributions, and it is 
characterized by the probability differential 

dp = 4>{x)dx = -4= e-l to-m^dx, . . (2) 

OV 277 

where fi is the mean, a the standard deviation. 

When small corrective terms involving n are retained, 



APPROXIMATE DISTRIBUTIONS 


59 


a closer representation is given by the probability function 
of Type A, namely 

p(x) = ^W-a 3 ^ ,,/ (^)/3!+^ iv W/4!-..., . (3) 

where <f>(x) denotes the normal probability function, and 
the coefficients a r of the derivatives <f>i T) (x)/r\ are of 
irregularly decreasing orders of magnitude with respect to 
n . The coefficient a 3 , when freed of arbitrary units, 
measures skewness, a A measures excess. 

As noted above, the case when p is very small is 
exceptional. If p is 0(l/n), the mean np is not 0(n) 
but 0(1). In this case the normal function is not the 
most suitable basis of approximation, and the appropriate 
asymptotic probability function is Poisson's function of 
statistical rareness, namely 

tfi(x) = e-fyv*!. • • • (4) 

where //, is the mean. Here again, when terms of smaller 
order involving n are retained, a closer representation is 
given by the probability function of Type B, namely 

p(x) = ^)+6 2 vVW/^-^V 3 0W/3!+..., . (5) 

where ip(x) is Poisson’s function (4) above, and V denotes 
the operation of forming the receding difference, so that 
1). It proves to be the case that 
b 2 is Ofa- 1 ), b 3 and b A are 0(n - 2 ), ft 5 and b e are 0(n~ 3 ), 
and so on. 

We now consider the derivation of these functions. 

31. The Normal Function as Limit of the Binomial. 

The rigorous derivation of the normal function as generated 
by compounding n independent distributions, and the 
discussion of necessary and sufficient conditions, require 
advanced mathematics beyond our scope. We content 
ourselves here with elementary and incomplete treatments. 

Consider first the binomial g.f. (pt+q ) n , where p is 
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not of order 1/n, but is 0(1). Putting t — e a we have 
the m.g.f. 

(pe a +q) n = (l+^a+j?a 2 /2!+pa 3 /3I+) n .(1) 

The mean is np. Let us transfer to the mean, and to 
discover the limiting shape of the curve of probability 
let us alter the scale, so as to find the distribution, not of 
actual number of successes x t but of the deviation 
[x—np)\n of the relative frequency of successes from the 
mean p of relative frequency. 

As a first step we construct the m.g.f. of o?/». By 
11 it is 

[l+pa/n+pa 2 /2!n 2 +0(n~ 3 )] n 
= [(l+pa/7i+ip 2 a 2 /ri 2 )(l+i(p-p 2 )a 2 /n 2 h 0(n~*))] n , (2) 

where 0(n~ 3 ) indicates in both cases remainder terms of 
order w -3 . As n increases this m.g.f. tends asymptotically to 

e pa e £;:ga 2 /n.(3) 

The first factor shows that the mean of the transformed 
variate is p ; but this we already know. The second 
factor indicates, by a further obvious transformation of 
scale, that the m.g.f. of the standardized deviation 
z = (; x—np)ly/(npq) is 

e ia *.(4) 

Now the possible number x of successes may range from 
0 to n. Thus the values of z may range from — \Z(np/q) 
to +\/(nq/p), a range which tends in both directions to 
infinity. Further, consecutive values of x differ by 1, 
and so consecutive values of z differ by 1 IV( n P<l)> an 
interval which tends to zero as n increases. We therefore 
seek a representation of the probability function <f>(z ) as 
a positive function continuous over the range — oo to oo; 
and the question is, what function </>(z) is such that its 
m.g.f. 


(f>(z)e az dz = e* a ‘ ? 

—oo 


• (5) 
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The answer is contained in a theorem, to the effect that 
the only positive continuous function satisfying this 
relation for some continuous range of values of a is 

4>(z) = -7= e ~ lz '> ■ • • ( 6 ) 

V2tt 

and this is the normal probability function, in standard 
form. 

The reader should become thoroughly familiar both with 
this form and with the unstandardized form of 30 (2). 

Incidentally, taking the logarithm of the m.g.f. (4), 
we see that apart from the mean or first cumulant there 
is only one other cumulant, namely k 2 or cr 2 . 


32. Properties of the Normal Probability Function. 

The curve of the normal function is a symmetrical bell¬ 
shaped curve, extending to infinity on either side and 
flattening rapidly upon the axis of x. 



The maximum ordinate is y 0 = l/\/(27r). 
under the curve is 


1 

V 277 



e~* zi dz = 1 , . 


The area 


• ( 1 ) 


by the well-known integral. (Gillespie, Integration , p. 88.) 
The points of inflexion, given by d 2 yjdz~ = 0, will be 
found to be at z = ±1, or, in unstandardized units, at 
deviations ia from the centre. 

The probability, as taken from the normal curve, that 
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a deviation from the mean is numerically less than z is 
the area under the curve between the ordinates for —z 
and +z, namely 


1 

V27r 


/: 


e-K'dz. 


* ( 2 ) 


This function, called the error function or probability 
integral , is denoted by erf(z) and has been extensively tabu¬ 
lated. (It is called the error function because the typical 
distribution of errors committed by instruments of observa¬ 
tion has been found to be sensibly normal.) The following 
short table shows how the probability of deviations outside 
the range (—z, z) diminishes as z increases : 

z 0 0*5 1 0 1-6 2 0 2-5 3 0 3-5 4-0 

erf(z) 0 0-383 0-683 0-866 0-964 0-988 0-997 0-9996 0-99994 

We may note that the probability of a deviation 
greater than a is about 1/3 or more nearly 7/22 ; that 
of one greater than 2<j is about 1/20 or more nearly 1/22 ; 
that of one greater than 3<r is about 1/370 ; and that of 
one greater than 4o is about 1/17000. 

The quartile deviation or so-called “ probable error ” 
is given by 

. . . ,3, 

By interpolation it is found to be z = 0*6745 nearly, 
corresponding to a deviation from the mean of about 2/3 
or more nearly 27/40 of the standard deviation. 

The mean absolute deviation is given by 

—jL=. f ze~* z 'dz = \/(2/7r) = 0*7979 nearly, . (4) 
V 277 J 0 

corresponding to about 4/5 of the standard deviation. 

The higher moments of the normal function are found 
by expanding the m.g.f. exp(|a 2 ), or the unstandardized 
exp(Ja 2 a 2 ), and observing the coefficients of a r /rl. For 



POISSONIAN PROBABILITY FUNCTION 


63 


odd orders they vanish, for even orders 2r they are given by 


In particular 


Mi r = (i) f o r2f (2r)!/rl 
= 3a 4 , 


• (5) 


so that (17) the coefficient of excess /? 2 = fJLjfi* = 3. 


33. Poissonian Function oi Rare Statistical 
Frequency. We return to the binomial g.f. {pt+q) 11 , 
examining the previously excepted case in which, though 
n becomes large, p is so small that the mean np is 0(1) ; 
in fact p = 0(nr x ). Writing the mean np as p 9 we have 
p = pjn. The f.m.g.f. is therefore (22) 

(1+jaa/w) n , which tends to e^ a . . (1) 

as n increases. This is the f.m.g.f. of the Poissonian 
function. The probability g.f. is therefore 

c' t(t_1) . • - • • (2) 

and the coefficient of t x in this gives the desired probability 
function as 

i/j{x) = e~^p x /x\. . . . ( 3 ) 

34. Properties of the Poissonian Function. The 

normal function contains two parameters, the mean p 
and the standard deviation a. The Poissonian function 
has one parameter only, the mean p. The range of the 
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function is from x = 0 to x= oo. For /x< 1 the pro¬ 
bability polygon is J-shaped, for /x > 1 it becomes double¬ 
sided and for large values of /x tends to acquire symmetry. 
Indeed, for large values of jtx the shape is approximately 
normal; for the ordinary m.g.f. is 

exp[/x(e a —1)] = exp(ju^+/xa 2 /2!+/xa 8 /3!+...) (1) 

and if we change the scale so as to make \/\l the unit we 
obtain the g.f. 

exp( J a»a+a 2 /2!+a 3 /3! / x‘+...). • • (2) 

which, to a first approximation (that is, including the 
first two terms of the series in the bracket) is the m.g.f. 
of a normal function with mean and unit standard 
deviation. 

The logarithm of (1) gives the cumulant g.f. of the 
Poissonian function as jx(e a —1), which shows that all the 
seminvariants k t are equal to the mean p ; in particular 
the variance k 2 or /x 2 is equal to /x. 

There is only one factorial cumulant, #c (1 ) = /x. 

35. More General Derivations ; Types A and B. 

As before, the extensions of the domain of application of 
the fundamental distributions given below are not 
established under the widest conditions. 

Let us consider the compounding of n systems S j} 
where j — 1, 2, ..., n, where each system has finite cumu- 
lants, all 0(1). The cumulant g.f. of Sj is then a convergent 
series 

K^a) = K^a+K{yi2\+K^/3\ + .( 1 ) 

For example, the binomial distribution of n throws of a 
coin, provided that p is 0(1) and not 0(7i _1 ), may be proved 
to have a c.g.f. of this kind. 

Now imagine all the n systems S$ to operate independ¬ 
ently, the results being added to make a variate x. By 
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the additive property of cumulants the cumulant g.f. 
of x is 

2 , (^a+4 i a 2 /2!+4 i a 3 /3!+--0 > • . (2) 

3 

and the c.g.f. about the mean of x is the same with the 
term in a removed. The second cumulant of x is clearly 
0(n ), and so the standard deviation is 0(n *). Let us 
therefore alter the scale so that x/y/n becomes the variate. 
The c.g.f. of this variate is then 


K 1 a-\-K 2 a 2 /2\-{-K z a z /3\-\-... y . . (3) 


where k x is 0(n *), k 2 is 0(1), k 3 is 0(71“*) and in general 
K r is 0(?i 1_ir ). Again, the c.g.f. about the mean is the 
same with the term in a removed. 

Thus as n increases the dominant term in the c.g.f. 
about the mean is k 2 a 2 /2!, which is the c.g.f. of a normal 
function 


t(z) 


1 


V(2 ttk 2 ) 




. (4) 


If, however, we retain the terms of smaller order, while 
choosing the scale so that /<r 2 = 1, the m.g.f. about the 
mean is 


M(a) = exj)(ia 2 )ex^(K 3 K 2 ~ i a 3 l^\-\-K A K 2 - 2 a A l4 :\-\-...) . 

= exp( |a 2 )(l +a 3 a 3 /3! +a 4 a 4 /4! +...) . . (5) 

where the second factor in brackets on the right arises 
from the expansion of the second exponential in the first 
line. Now if a probability function P(x), which vanishes 
with all its derivatives at the boundaries, has m.g.f. 

M(a) = j b p(x)e°*dx . . . ( 6 ) 

it may be proved by r integrations by parts that 

a r M(a) = (—) r ^ > j i ~^jp(x)e ax dx. . (7) 


E 
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Thus here, reverting the m.g.f. of (5) term by term, we 
derive the corresponding probability function as 

p(z) = <f>(z) —a z <f>"'(z)l3\ +a 4 <£ lv (2)/4!—... (8) 

provided that the series for m.g.f. and probability function 
are convergent. This is the probability function of Type A. 

A close examination of the magnitude of terms in the 
expansion of 

exp(K 3 K 2 -^/3l+K i K 2 - i a i H\+...) . . (9) 

shows that the order of magnitude of coefficients in the 
series of Type A is as follows : 

a 3 = a 4 and a G = Ofir- 1 ), a 6 , a 7 and a 9 = 

and later coefficients show a similar irregularity. 

Here let us pause to point out a practical disadvantage 
of the representation by Type A. If we are representing 
a given frequency distribution by Type A, we must use 
the observed moments to estimate the coefficients a 3 , a 4 , ... 
in Type A. Let us suppose that the convergence demands 
the retention of terms up to 0(n _1 ). We must then include 
not only a 4 but also a 6 . Now a 6 depends on the 6th 
moment, and the 6th moment of the observations is subject 
to very high sampling error ( 68 ). Hence the effort to 
increase mathematical accuracy by retention of higher 
terms is largely frustrated by the statistical inaccuracy 
of the moments used to estimate those terms. 

Series of Type B. The procedure for deriving the 
function of Type B is rather similar. The f.m.g.f. proves 
to be 

exp(pa)(l +6 2 a 2 /2!+6 3 a 3 /3I+...), . . (10) 

which on reversion term by term gives 

p(x) = tp(x) +6 2 vV(z)/2t—6 3 vV(z)/3!+*". (11) 

the series of Type B, where ifj(x) denotes Poisson’s function 
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of 33 (3). Here the order of magnitude of coefficients is 
found to be : 

b 2 = Ofa- 1 ), b z and & 4 = 0(n~ 2 ), b 5 and & 6 = 0(n~ 3 ), 

and so on. Thus in using the function of Type B for the 
representation of a frequency distribution it is best to 
truncate the series after a difference of even order. 


36. Other Systems of Probability Functions : the 
System of Pearson. We have seen how the functions of 
Types A and B arise by the addition of cumulant (or 
factorial cumulant) generating functions, corresponding 
to the compounding of values of an additive variate. 
But a variate of this kind is a very special one. For 
example, if x is built up of added increments, then x 2 , 
which we might have occasion to use instead of x , is 
certainly not the sum of the squares of those increments. 
Indeed, as we may well anticipate, the distribution of x 2 
is different from that of x. 

For this and for other reasons the scope of typical 
probability functions has been widened, and systems 
other than Type A and Type B have found acceptance. 
One such system is the system introduced in 1895 by 
Karl Pearson. 

Let us consider the difference or differential equations 
satisfied by some of the standard probability functions. 
We shall use the receding difference operation defined by 

V$(z) = <f>(x— 1). 

(i) The binomial probability function of 22 (1) satisfies 


V<£(*) = 


S — (w + l)p 
p(n— a-+l) 




( 1 ) 


(ii) The Poissonian function ip(x) of 30 (4) satisfies 


V<p(x) = - —- <A(*). 


■ (2) 
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(iii) The hypergeometric probability function of 29 (2) 
satisfies 

*W+2)-(M +!)(»+!) 


V<£(*) = - 


(M — x+l)(n— x+l) 


(iv) The normal probability function in standard form 
31 (6) satisfies 

^ = -x4>{x). . . . (4) 

A number of other probability functions, arising 
naturally in problems of repeated trials, might be added 
to this list. The Pearsonian system consists of the 
functions <f>(x) which satisfy the differential equation 

= _ (*-<% (5) 
dx c 0 +c 1 x-\-c 2 x 2 

The functions are found by immediate integration ; 
thus 

i_ f 


log y = — 


1 Co+CiX+CjjX 2 ’ 


whence y can be found by the methods of elementary 
integral calculus. The quadratic in the denominator of 
the integrand may have real, variously positive or negative, 
or equal, or numerically equal but of opposite sign, or 
complex roots ; or again, with c 2 = 0, may degenerate 
into a linear function, or with Cj and c 2 = 0 into a constant. 
These various cases yield the Pearsonian curves, usually 
classified into twelve types ; while the discriminant of 
the quadratic, expressed in terms of moments of the 
curves, yields a “ criterion ” for judging in advance what 
type is appropriate to a proposed frequency distribution. 

A full account of the curves, their shape and the process 
of representing frequency data by them is given in 
Elderton’s Frequency Curves and Correlation (3rd edition, 
London, 1938), to which we refer the reader for details. 
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Here we have space to mention from time to time only 
a few of the curves, as they occur in special problems. 

37. Probability Functions Generated by Change 
of Variate. If x is distributed about the mean x = 0 
in a normal distribution 

dp = (277 )-h-Wdx, . . . (1) 

it is certainly not the case that x 2 is normally distributed ; 
for putting z = \x 2 , we have dx = (2z)-*dz t and so 

dp = 7r~h~*e~ z dz. . . . (2) 

The range of z is from 0 to oo, and the constant 77 is 
such that the integral of the probability function of z 
over this range is 1. The distribution of z is skew, and 
is actually a case of Pearson’s Type III. 

Ex. 1. Prove that the m.g.f. of z is (1 — a)~*. 

Again, if x is distributed between — £ and \ in the 
rectangular distribution dp — dx, the cube root z == x* 
is distributed, as the reader should verify, in the U-shaped 
distribution dp = 3 z 2 dz. Or again, to take an example 
from physics, if the distribution of the velocities of a 
great number of particles about a zero mean velocity 
were normal, the distribution of their energies would be 
of Type III. 

The derivation of probability functions from the 
normal function by non-linear change of variate was 
emphasized by J. C. Kapteyn in 1903 (Skew Curves in 
Biology and Statistics , Groningen), but was by no means 
a new conception even at that time. 

Ex. 2. If a; is a normal variato in standard measure, we 
have seen in Ex. 1 that the m.g.f. of z — x 2 is (1 — 2a)-*. 
Hence the m.g.f. of + ... ~^ x n 9 w here the x i are in¬ 

dependent normal variates w r ith the same mean x = 0 and 
in standard measure, is (1— 2a)~* n . The probability function 
which has this m.g.f. is unique, and of the form ezh*-2) e -ln*. 
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The reader should verify that this function actually has the 
above m.g.f., and should find by integration the value of c. 

Ex. 3. If x is distributed normally about x == 0 as mean, 
find the distributions of: (i) z = e*, (ii) z = x* 9 (iii) z = x L 


38. Cauchy’s Probability Function. The pro¬ 
bability function which we shall next consider arises by 
change of variate in a rectangular distribution. Let us 
take a point Q on the axis of y at unit distance from the 
origin O. Let a straight line be taken at angle 0 to QO, 
all values of 0 from — \tt to \tt being equally likely, to 
cut the x axis in the point X = ( x , 0). What is the 
probability distribution of x ? 

The distribution of 0 is rectangular, dp = 7r _1 d0. 
Also x — tan 0 f so that 0 — arctan x , d0 = dx/( 1+x 2 ). 
Hence the distribution of x is given by 


_ 1 dx 


range — oo to oo. 


The probability function appearing here is Cauchy’s 
probability function. It has the property (very awkward 
for any theory of estimation from sample based on 
moments) that its moments of even order /jl 2 , /x 4 , ... are 
all infinite. The reader should verify this by integration. 
It follows at once that linear compounding of independent 
variates obeying laws of Cauchy type cannot be carried 
out by the addition of cumulants ; in fact the cumulant 
g.f.’s do not converge. This exception to the common 
rule gives us a salutary reminder that linear compounding 
of independent variates does not necessarily generate a 
distribution of normal type. 

The Cauchy curve has been found to possess a specially 
remarkable property. If n independent variates obeying 
the same Cauchy law are added, and the mean is taken, 
this mean obeys exactly the same law. Not only so, but 
the distribution of any linear combination 


2 = +c 2 x 2 +... +c n x t 
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of variates x i obeying the same Cauchy law, where the 
Cj are positive and sum to 1, is again exactly the same 
Cauchy distribution. 



The figure shows the normal curve in standard measure 
and the flatter Cauchy curve drawn to the same scale. 

39. The Pearson Curve of Type I. As a final 

example of a probability function arising from a particular 
problem, let us consider the following : 

Suppose that x is distributed in the rectangular 
distribution over the range 0 to 1. Let n + 1 points x t 
be taken independently in this range. What is the 
probability that the (& + 1)** point of these, as counted 
from the left of the range, is in the elementary interval 
x—\dx to x+\dx ? 

The probability is compound ; it is the probability 
that one, any one, of the n+l points is in the interval, 
and that k of the remaining n are in the range 0 to x—\dx 
while n —k are in the range x-\-\dx to 1. Hence the 
compound probability is 

dp = <f>(x)dx = n {k) (n-{-l)x k (l — x) n ~ k dx, . (1) 

for the first probability mentioned is (n+\)dx and the 
second is n^ k) x k {\—x) n ~ k . The probability function <f>(x) 
obtained here is of Pearson’s Type I. It is in fact the 
integrand of the Beta function (Gillespie, Integration , p. 84), 
apart from the factor w (Jfc) (tt+l) which ensures that the 
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area under the curve is 1. Had the range been a to 6, 
we should have obtained 

<f>{x) = n (i) (x—a) k (b—x) n ~ k . . (2) 

The probability integral of the simpler form (1) over 
the partial range (0, z) is called the Incomplete Beta 
function. In the same way the integral over (0, x) of 
the function 

■ ■ ■ m 

which is a case of Pearson’s Type III, is called the 
Incomplete Gamma function. 

Variety of Probability Curves. The preceding 
survey of types of probability function, though far from 
exhaustive, will have served to dispel the idea, once 
rather .prevalent, that normality and symmetry were the 
rule and that skewness was an accident of sampling. The 
r61e of the normal distribution in statistics is not unlike 
that of the straight line in geometry ; and we do not 
force curves into the mould of the straight line. Skew 
distributions are in fact the predominant type, for skew¬ 
ness arises from Lexian variability or non-homogeneity, 
from Poissonian statistical rarity, from limitation in the 
number of causes of variation, and from non-linear 
transformations of the scale. 

Readers of fairly advanced mathematical attainments, and 
interested in the rigorous derivation of the normal distribution 
and its corrective terms, will find great profit in studying the 
Cambridge Tract No. 36, Random Variables and Probability 
Distributions (1937), by Harald Cram6r. 
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PRACTICAL CURVE-FITTING WITH 
STANDARD CURVES 

40. Representation of Frequency Data by Normal 
Curve. The present chapter will be devoted to the 
numerical details of representing frequency distributions 
by normal curves, curves of Type A, Poissonian curves 
and curves of Type B. 

In fitting the normal curve, that is, in finding the 
equation of the normal function of best approximation 
to the given frequency distribution, the idea is to represent 
the relative class frequencies by the corresponding segments 
of area under the normal curve between neighbouring 
ordinates corresponding to consecutive class boundaries. 
The mean m x or m of the frequency distribution is taken 
as the estimate of the mean jjl or of the normal function ; 
the second moment m 2 or s 2 , corrected for grouping if 
necessary by Sheppard’s correction, is taken as the 
estimate of the corresponding /jl 2 or cr 2 . In order to use 
the standardized tables of the normal probability integral 
it is best, once m\ and m 2 have been computed, to 
standardize the class boundaries, taking them as deviations 
from the mean, in units of s. The values of the probability 
integral corresponding to these class boundaries are then 
read from tables (Appendix 4); the first differences of these 
values are the estimates of the class probabilities ; and 
finally we may multiply by n, the total number in sample, 
to make comparison with the absolute class frequencies. 

Example. In the data of heights of Irishmen ( 18 , Ex.) 
the mean is 67*34, and m 2 with Sheppard’s correction is 
4.705-0*083 = 4*622. Hence a = 2*15, 1/s = 0*465. The 
standardized deviations of class boundaries are shown in the 
column z — (x — m±J)/s below. Since their common differ- 
73 
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ence is 1 /s or 0*465, they are readily found, when once any 
one of them has been computed, by repeated addition or 
subtraction of 0*465, and the results can be checked at the 
ends of the range. The next column shows the values of 
£ erf ( 2 ), the next the first differences of these, the next the 
same multiplied by 346, and the final column the original 
class frequencies themselves for comparison. 


X 

2 = (x—m±i)/s 

£ erf 2 

JA erf 2 

£nA erf 2 

obs. 


— 00 

-0*5000 




59 

—3*646 

-0*4999 

0*0001 

0 

1 

60 

-3*181 

-0*4993 

0*0006 

0 

0 

61 

-2*716 

— 0*4967 

0*0026 

1 

2 

62 

-2*251 

-0*4878 

0*0089 

3 

2 

63 

-1-786 

-0*4629 

0*0249 

9 

7 

64 

-1-321 

-0*4068 

0*0561 

19 

15 

65 

-0*856 

-0*3040 

0*1028 

36 

33 

66 

-0*391 

— 0*1521 

0*1519 

53 

58 

67 

0*074 

0*0295 

0*1816 

63 

73 

68 

0*539 

0*2051 

0*1756 

61 

62 

69 

1-004 

0-3423 

0*1372 

47 

40 

70 

1-469 

0-4291 

0*0868 

30 

25 

71 

1-934 

0*4734 

0*0443 

15 

15 

72 

2-399 

0-4918 

0*0184 

6 

10 

73 
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41. Representation by Type A. The coefficients 
a 3 , a 4 , ... in the series 35 (8) of Type A can be expressed 
in terms of the moments about the mean. For by 35 (5) 
the m.g.f. (in unstandardized scale) is given by 

1 +fjL 2 a 2 /2\ +jL6 3 a 3 /3! +p, 4 a 4 /4! • 

= exp(Ja 2 a 2 )(l+a 3 o 3 a 3 /3!+a 4 a 4 a 4 /4! + ...). . (1) 

Multiply each of these expressions by exp(— icr 2 a 2 ) 
and expand the product in the former case. Equating 
coefficients of a r /r !, we have the desired relations 

a 3 = Pzlo 3 ’ 
a 4 = 0* 4 —3/xf)/o 4 , 
a 5 = (ja 6 —10/X 2 /X 3 )/(T 5 , 

a 6 == 0^6 15/x 2 //.4+30/x|)/a®, . . (2) 

and so on. 

The routine for fitting Type A is a slight extension 
of that used in fitting the normal curve. Moments about 
the mean are computed and if necessary corrected by 
Sheppard’s corrections. The coefficients a 3 , a 4 , ... are 
estimated from these moments by the formulae just given, 
with m r substituted for /x r . The integral of the corre¬ 
sponding Type A series is then taken instead of the normal 
probability integral. This involves the necessity, if terms 
in a z and a 4 are included, of having supplementary tables 
of the integrals of the functions which appear in these 
terms, that is, tables of 



and 



Such tables have been computed and are available. 
(British Association Tables, 1931 ; Bowloy, Elements of 
Statistics , p. 303, F z (z) only.) 

Example. (Bowley, Elements of Statistics , p. 309.) To 
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fit two terms of a series of Type A to data giving age 
distribution of St Louis school children in the sixth grade. 
(Age x means x to x +1.) 

x 10 11 12 13 14 15 16 17 18 n 

nf 26 201 673 1001 739 310 80 13 1 3044 

By the usual routine we compute = 13-665, m 2 — 1 -498, 
mj = 0-356. Hence, using Sheppard’s corrections, the 
corrected 

s* = 1-498-0-083 = 1-415, s = 1-190, 1/s = 0-840, 


estimated a 8 = m^/s 3 = 0-211. 

The rest of the working can be arranged in columns as 



below. 







(1) 

(2) 

(3) 

(4) (5) 

(6) (7) 

(8) 

(9) 

(10) 

X 

t - ( x-m)l8 

i erf 2 

F t (z) a,F,(z) 

(3)+ (5) A 

nA 

obs. 

normal 

10 

— 00 

-0-5000 

-0-0665 -0-0140 

-0-5140 







0-0079 

24 

26 

38 

11 

—2-24 

-0-4875 

-0-0882 -0-0186 

-0-5061 






0-0678 

206 

201 

208 

12 

-1-40 

-0-4192 

-0-0904 -0-0191 

-0-4383 







0-2202 

670 

673 

630 

13 

-0-66 

-0-2123 

-0-0275 -0-0058 

-0-2181 







0-3268 

995 

1001 

982 

14 

0*28 

0-1103 

-0-0076 -0-0016 

0-1087 







0-2440 

743 

739 

786 

IB 

1-12 

0-3686 

-0-0755 -0-0159 

0-3527 







0-1024 

312 

310 

324 

16 

1-96 

0-4750 

-0 0942 -0-0199 

0-4551 







0-0264 

80 

80 

68 

17 

2-80 

0-4974 

-0 0755 -0-0159 

0-4815 







0-0042 

13 

13 

8 

18 

3-64 

0-4999 

-0-0672 -0-0142 

0-4857 






0 0003 

1 

1 

0 

19 

oo 

0-5000 

-0-0565 -0-0140 

0-4860 


— 






3044 

3044 

3044 


The closeness to the observations is remarkable. Indeed 
the tests of “goodness of fit,” to be developed in 54, show 
that the discrepancies are so small as to be improbable, and 
the representation is unsatisfactory. We have here a case 
of “ over-fitting.” 

For comparison we have included in a final column the 
results given by the normal curve of best agreement. 


42. Representation by Poissonian Function or 
Type B. The coefficients b 2 , b 9 , ... in the series 35 (11) 
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of Type B can be expressed in terms of the factorial 
moments p,, fi (2) , ^ 3) , .... For by 35 (10) the f.m.g.f. 
is given by 

1 +fia+fjL (2) a 2 /2\ +fjL ( 3) a 3 /3! + ... 

= exp(/ia)(l+6 2 a 2 /2!+6 3 a 3 /3!+...) . . (1) 

Multiply each of these expressions by exp(— fid) and 
expand the product in the former case. Equating 
coefficients of a r jr\ we have the desired relations 

*>2 = ^<2 

b 3 = 7^(3)— ; 3/x (2) /x+2/x 3 , 

^4 == 7 ^( 4 )- fy*’ • • ( 2 ) 

and so on. Note that the numerical coefficients are the 
same as occur in 14 (5). 

The procedure of fitting by Type B is therefore to 
compute factorial moments of the data by the summation 
method (Appendix 2) and by substitution in the above 
formulae to estimate the coefficients b 2 , 6 3 , ... of the 
Type B series. For the rest of the work we require the 
values of e~ m m x jx\ and its differences of as many orders 
as may be necessary. 

The value of e~ m can be taken from a table (p. 147) of 
the exponential function. Then ne~ m is computed, after 
which each value of ne~ m m x jx\ can be obtained from 
the preceding value, corresponding to 1, by multiplying 
by mjx, most easily done by a calculating machine. The 
subsequent differencings and multiplication by coefficients 
b 2 and so on can best be followed from the illustrative 
example. 

Example. E. Rutherford and TT. Geiger, in 2608 experi¬ 
ments (Phil. Mag., Ser. 6, 20, 1910, p. 698) on the number x 
of a-particles radiated from a disc in 7*6 seconds, obtained 
the distribution : 

x 0 l 23 4 6 6789 10 11 12-14 n 

nf 67 203 383 626 632 408 273 139 46 27 10 4 2 2608 
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The summation method for factorial moments gives 
m = 3*870, wi( 2 ) = 14-784, whence the estimate of b 2 / 2! is 

i(14-784—3-87 2 ) = -0-0965. 


The working is set out in columns as below. 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

X 

tuft 

3 

<3 

nVV 

inb,VV (2)+(6) 

obs. Poisson 

0 

54-40 

54-40 

54-40 

-5-25 

49 

57 

54 

1 

210-62 

156-12 

101-72 

-9-82 

201 

203 

211 

2 

407-37 

196-85 

40-73 

-3-93 

403 

383 

407 

3 

525-49 

118-12 

-78-73 

7-60 

533 

525 

526* 

4 

608-43 

-17-06 

-135-18 

13-05 

522 

532 

509* 

6 

393-52 

-114-91 

-97-85 

9-44 

403 

408 

394 

6 

253-81 

-139-71 

-24-80 

2-39 

256 

273 

254 

7 

140-34 

-113-47 

26-24 

-2-53 

138 

139 

140 

8 

67-89 

-72-45 

41-02 

-3-96 

64 

45 

68 

9 

29-18 

-38-71 

33-74 

-3-26 

26 

27 

29 

10 

11-29 

-17-89 

20-82 

-2-01 

9 

10 

11 

11 

3-96 

-7-33 

10-56 

-1-02 

3 

4 

4 

12 

1-28 

-2-68 

4-65 

-0-45 

1 

2 

1 

13 

0-39 

-0-89 

1-79 

-0-17 

0 

0 

0 


2608 2608 2608 

N.B .—(i) In the differencings in columns (3) and (4) \p( — 1), 
^( — 2)... are tacitly taken as zero, (ii) The asterisked entries in 
column (8) have been raised from those in column (2) to make the 
totals of columns (7) and (8) both come to 2608. 

It will appear when we come to consider goodness of fit 
(54) that the representation by tho Poisson function alone, 
without the term in b % , is satisfactory. 


43. Limitations on the Use of Moments in Fitting 
Curves. The discussion of the Cauchy distribution in 
38 has shown that moments are by no means always, or 
necessarily, the best parameters to use in representing an 
observed frequency distribution by a probability distribu¬ 
tion of assigned functional form. It depends entirely on 
the nature of the probability function what parameters 
may be used with adequacy. For example, since the 
mean of any number of observations x, each of which 
obeys the same Cauchy distribution, has exactly the same 
Cauchy distribution as x, it follows that the mean of sample 
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in this case is no more accurate, for the purpose of estimating 
the centroid of the curve, than any single observation; 
indeed it may be shown that the median is much superior 
for this purpose, while still better parameters can be 
found. Again, for the purpose of estimating the centre 
of an unknown rectangular probability distribution the 
mean of the n sample observations x f is quite a good 
estimate ; but surprisingly enough, as R. A. Fisher has 
shown, the mean of the two extreme observations alone is 
remarkably better. As a general precept it may be 
stated that for probability curves of shape and properties 
approximating to the normal curve the use of the mean 
and moments of the frequency distribution gives good 
estimates for those parameters in the probability distribu¬ 
tion ; but for other probability curves better parameters 
can be found. 

Example. Instructive material bearing on the rectangular 
distribution may be procured from Barlow’s Tables. The 
last two digits of the decimal parts of the cube roots of 
integers n , as given in Barlow, provide a distribution con¬ 
forming very closely to a rectangular distribution, of range 
00 to 99, and centre at 49-5. We may take the 60 or 49 
entries on each page, omitting the cases where n is a perfect 
cube, and we may record the mean of the last two digits 
over this whole sample, as well as the mean of the highest 
and lowest values only. The results, for each page from 
n = 100 to u = 999, will illustrate the accuracy of two 
different methods of estimating from sample the mean of a 
rectangular distribution. 



CHAPTER V 


PROBABILITY AND FREQUENCY IN 
TWO VARIATES 

44. Bivariate Distributions: Correlation and 

Regression. Hitherto we have been concerned 
exclusively with probability and frequency distributions 
in one variate, that is, with univariate distributions. But 
most of the important and interesting applications of 
statistics involve bivariate, trivariate or multivariate 
distributions. 

Let us consider how a typical bivariate frequency 
distribution may arise. Suppose that 1000 soldiers in a 
regiment are measured in height, x , and in weight, y. 
The measurements provide 1000 paired numbers {x jy y*), 
which may be plotted as points in a plane. The resulting 
assemblage of points may be called the “ dot diagram.’* 



O x 


Now there may be, and in fact in the case of height 
and weight there is, a tendency for the value of y f to 
80 
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conform in some way to that of the corresponding Xj ; 
greater height as a rule is associated with greater weight. 
Any such tendency towards a functional relationship, 
obscured by random deviations, will manifest itself in 
the dot diagram by the greater density of the dots along 
a certain locus. This locus is not sharply outlined, but 
its estimation is important, for it is a smudged image 
of a curve which may be fine and clear-cut in the parent 
population of which the observations are a sample. This 
latent curve or functional relation y = F(x) is called a 
regression , the regression of y on x. It will be a matter of 
judgement what functional basis is chosen for its mathe¬ 
matical representation. Usually the representation is a 
linear one based on a set of prescribed functions Pj(x), 
p 2 (%)> the regression therefore appearing in the 

form 

y == a Q +a 1 p 1 {x)+a 2 p 2 (x)+..., . . (1) 

to as many terms as are judged adequate. The statistical 
problem is then to determine the best estimates of a 0 , a v 
a 2 , ... from the n paired observations (x jt yj). The 
functions p t (x) are commonly polynomial or harmonic 
functions, but they may be of any preassigned functional 
type. 

The diagram of dots suggests a second point of 
view. The proportion of dots in an elementary region 
x —£ Ax< x< x + \Ax, y —£ Ay < y < y+\Ay gives an element 
of bivariate relative frequency which corresponds to a 
bivariate differential element of probability, let us say 
dp = <f)(x , y)dxdy, in the parent population. 

We may imagine that on each class-rectangle of the 
network of rectangles delimited by class boundaries of 
x and y a right prism is erected, of volume proportional 
to the corresponding class frequency. The tops of these 
prisms make a surface of flat terraces which we may call 
the prismogram , the analogue in three dimensions of the 
histogram. This prismogram, then, is the rough sampling 

F 
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approximation to an ideal probability surface z = <f>(x, y) 9 
which is often called the correlation surface . 

The functional dependence of y on x may be investigated 
either by the method of correlation, which consists in 
estimating the parameters of the bivariate probability 
function <f>(x, y ), or by the method of regression, which 
consists in estimating the coefficients a i in the regression 
function (1). Naturally the methods overlap to a certain 
extent. In the case of several important correlation 
(bivariate probability) functions the corresponding regres¬ 
sion curves are straight lines. 

45. Binomial and Hypergeometric Correlation. 

The natural extension of the twofold division, success and 
failure of an event E, which gives rise to the binomial or 
hypergeometric distribution in one variate, is an arrange¬ 
ment giving a twofold division in each of two events 
E and F. Such an arrangement is expressed by the 
fourfold table, as follows : 

Let the probabilities of the double events ( E,F ), (E,F), 
(£, F) and (E, F) be p n , p 10 , p 01 , p 00 . These are set 
out as shown in the fourfold table, the columns referring 


l J 



to E and E, the rows to F and F. The sum Pu+p 10 » 
representing the total probability of E whether F occurs 
or not, is entered marginally as p ; and in the same way 
the other total probabilities q, p', q’ are entered margin¬ 
ally as sums of a row or of a column. 




BIVARIATE GENERATING FUNCTION 


83 


Contingency Table. The fourfold table is a simple 
example of a contingency table . The more general bivariate 
contingency table has h rows and k columns, corresponding 
to the division of one system into h categories and the 
other into k . If the probabilities p u are all rational 
fractions, it is possible to represent the bivariate population 
by a physical model, such as one of marked or coloured 
balls in due proportions. 

If E and F are independent events, then p n = pp\ 
Pio = M'> Pm = IP' and p 00 = qq' , so that p n p 00 = p 10 p 01 . 
The determinant PnPoo~PioPoi °f fourfold table is 
thus zero. 

Ex. 1. Prove that this determinant is equal to p n —pp' 
and to Pw—qq'- 

Generating Functions. Just as in 7 we introduced 
a variable t to carry x as exponent in univariate generating 
functions, so it is natural to introduce u to carry y. The 
probability g.f. of a fourfold table will thus be 

G(t, u) = p u tu+p 10 t+p m u+p 00 . . . (1) 

= 1+!>(<—l)+2>u(<—!)(«-!)• • (2) 

Ex. 2. Show that in the case of independence this splits 
into the two factors pt-\-q, p'u-\-q'. 

Now if we draw n times, with replacement each time, 
from the population characterized by the fourfold table, 
the g.f. will be 

(p u tu+Piot+Poi u +Poo) n - ■ ■ • (3) 

The coefficient of t x u v in the expansion of this g.f. will 
be the probability <f)(x , y) of having x cases E and y cases F 
in the n drawings. The function <f>(x y y) is the correlation 
function of binomial type. 

Ex. 3. If the variates x and y are independent, show that 
<f>(x, y) is the simple product of the binomial probability 
functions 


n{*)P*<T~* n(v)(p') v (<l') n ~*- 
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Again, if we have a fourfold population of N individuals 
with Np ll9 Np 10 , Np 01 and Np 00 individuals in the respective 
categories, and if we sample n times without replacement, 
the corresponding probability of x cases E and y cases F 
is the correlation function of hypergeometric type. Even 
when PnPoo—PioPoi = 0 an effect of correlation is induced 
by sampling without replacement. 

46. Bivariate Moments and Moment Generating 
Functions. The bivariate product moment of order r 
in x and 8 in y is defined by 

fi„ = ZZ<j>{x, y)x r y> or <j>(x, y)x r y’dxdy, . (1) 

xy J J 

or the corresponding mean values with E I dy or I dxl 7, 

xJ J v 

according to the discrete or continuous nature of the 
variables. 

There are three moments of the second order. If we 
take them with respect to the means /x' 0 and p! 01 of the 
variates they are /x 20 the variance of x, /x 02 the variance 
of y , and /x u the product moment of x and y 3 often called 
the covariance . 

Generating Functions. The bivariate generating 
function of probability is defined by 


Q(t, u) = EE<f>(x, y)t x u v or j y)t x u v dxdy , . (2) 

xy J J 

? J dy or J dxE. 


or the same with E 


Moment generating functions are defined by putting 
t = e a , u = eP 3 the general product moment fi rt being the 
coefficient of a r ]8*/r! s\ in the resulting m.g.f. 

Factorial moments can be defined by putting factorials 
x (r) and y lt) , as defined in 29 (2), instead of powers x r 
and y* ; and a bivariate f.m.g.f. may be constructed by 
putting 2=1 +a, u = 1 +j8. 
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Example. Prove that the f.m.g.f. of the fourfold table is 

(1 +po)( 1 +p'P) + (PnPoo -PioPoi)<*£- 


47. Normal Correlation as the Limit of Binomial 
Correlation. The g.f. of a sample of n drawings, with 
replacement each time, from the fourfold population is 

(p n tu +p 10 t +p 01 u +p Q0 ) n 

= [1+^—1)+^'(^—1)+^^—1)(^—1)]-, . (1) 
by 46 (2). 

Just as at the corresponding stage in 31, let us consider 
the deviation of relative frequency of number of successes 
from means, rather than absolute frequency. We do this 
by putting t = e a , u = eP in (1) and then writing a/n for 
a, Pin for /?. We have then the bivariate m.g.f. 

[1 +pa/n +pa 2 /2n 2 +p r pln+p f p 2 j2n z +p n ap/n 2 +0(w~ 3 )] n 
= [(l+paln+ip 2 a 2 ln 2 )(l+p f pin+^p f2 P 2 ln 2 )(l+^{p-p 2 )a 2 ln 2 
+i(p , -^ ,2 )j8> 2 +bii^>i 3 /^ 2 + 0 ( w " 3 )] n J ... (2) 


which tends asymptotically as n increases (the assumption 
throughout being that none of the probabilities in the 
fourfold table is 0(w -1 )) to 

exp (pa+p'fi) exp %(ala 2 +2pa 1 a 2 af}+o%3 2 ) . (3) 

where of == pq\n, of = p'q'jn, pa x a 2 = {p n -pp')!n. 


Next, just as in the case of one variate treated in 31, 
and for analogous reasons, the question is to find a 
continuous function <f>(x, y) satisfying 


CO 

J J <f)(x , y)e ax ^ r Pydxdy 


= exp i(o*a 2 +2pcr 1 <j 2 ap+alp 2 ). (4) 


The answer provided by pure mathematics is that the 
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only function <f>(x, y) for which this is the case over a finite 
domain in a and j8 together is 

<j>(x, y) = -- 1 - exp y)], 

Z'7T(T 1 (J2 \ 1—p 2 

where Q(x, y) = {x 2 /al—2pxyja 1 a 2 +y 2 lal)l(l—p 2 ) . (5) 

This function, the analogy of which with the normal 
probability function in one variable is evident, is the 
bivariate normal probability function or normal correlation 
function . The parameter p is called the coefficient of 
correlation . The reader will verify at once that when 
p = 0 the correlation function breaks up, as one might 
have expected, into the product of two ordinary normal 
functions, in x and y respectively. 

48. Properties of the Normal Correlation Function. 

Let us suppose that units of scale in x and y are standardized 
by putting = 1, a 2 = 1. The m.g.f. of the normal 
correlation function about the means then becomes 

exp £(a 2 +2pa£+0 2 ), . . . (1) 

and the coefficient of aj3/l! 1! in the expansion of this 
shows that p is the mean value of the product xy. This 
suggests that in computing the parameters of bivariate 
frequency distributions we should add to the usual four 
parameters of first and second order, namely the means 
m 10 , ra' 01 and variances s\ and a| of x and y, a fifth parameter, 
the mean value of the product of corresponding deviations 
x and y from the sample means. 

The standardized value of this mean product, namely, 

r = ^^ x ~ m 'io)(y- m oi)l 8 i 8 z> ■ • ( 2 ) 

corresponds in the sample to p in the population or 
probability function. We shall call r the Pearsonian 
coefficient, or product-moment coefficient, of correlation of 
x and y in the frequency distribution. 
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Limits of r and p. The extreme values that r and 
p can take are 1 and — 1. They cannot lie outside those 
limits. For, taking x and y as unstandardized deviations 
from their means, let us consider the mean value of 
(hx+ky) 2 , or h 2 x 2 -\-2hkxy-\-k 2 y 2 , both in population and 
in sample, where h and k are arbitrary real numbers. In 
population the mean value is K 1 o\+2hkpo l o 2 J r k 2 o\, in 
sample it is K z s\ +2Mrs 1 s 2 . . . . (3) 

Now these quadratic expressions in h and ifc, being the 
mean values of squared functions, are of necessity not 
negative. But the necessary condition for this is that the 
discriminants 

°i a l and (rsjs^-slsl . . (4) 

should not be positive. Hence p 2 ^. 1, r 2 ^ 1, so that 
both p and r must lie in the range —1 to 1. 

The result, it may be noted, depends on a property of 
quadratic expressions, and therefore holds not merely for 
normal but for any distribution of x and y. 

Example. Prove that if x and y are uncorrelated and of 
unit variance, x cos 9-\-y sin 9 and x sin 6—y cos 9 are also un¬ 
correlated and of unit variance. 

In the case of independent variates, under any laws of 
distribution, the product moment /z n about the means is 
zero. For if the separate m.g.f.’s of x and y about their 
means are 

^+/ x 2o a2 /^J+••• an d 1+^02 . (5) 

then by compound probability the m.g.f. of the two 
together is 

(*+^2o a2 /2 !+•••)(* +Mo2 ^ 2 /2!+•••)> • (6) 

and since this has no term in a/? we have p n = 0. 

It is most important to notice that the converse 
theorem is not true. The vanishing of p n does not imply 
independence . Consider for example the case when x is 
distributed in any symmetrical distribution about the 
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mean x = 0, with variance a 2 . Then z = x 2 —<j 2 is also 
distributed about a zero mean. The variates x and z 
have complete functional dependence. Their product xz 
however is x 3 —cr 2 x, and the mean value of this is clearly 
zero. This is an extreme case, but it gives a sharp 
warning against inferring the existence of independence 
from a zero value of p, and still more from a zero value 
of r, which is merely an estimate of p. 

The normal correlation surface, when p = 0 and 
variances are standardized to unity, is a symmetrical 
bell-shaped surface which may be generated by the rotation 
of its central vertical section, a normal curve, about the 
vertical axis. When p ^ 0 the surface acquires a hog¬ 
back ridge which lies in the first and third quadrants of 
(x, y) if p is positive, in the second and fourth quadrants 
if p is negative. 

The loci of equal probability density (7) are found 
by equating <f>(x , y) to a constant, yielding curves of the 
form 

—2pxyj a l( T 2 +2/ 2 /°l = c 2 . . (7) 

These are homothetic ellipses. Among them the ellipse 
which includes a region in x and y of total probability \ is 
sometimes called the “ probable ellipse,” a name, like 
“ probable error ” in 15, apt to mislead. This region is 
the bivariate analogue of the interquartile range (15). 

49. Regression Lines in Bivariate Normal Cor¬ 
relation. If we cut the normal correlation surface by a 
series of planes all perpendicular to the axis of x , the sections 
are all normal curves. For each such section corresponds 
to a constant value of x k of x , and so the 2 -ordinate of such 
a section is, in standard scale, 

z = <f>(x k ,y) = cexp[-l(xl-2px k y+if)l(l-p*)], . (1) 

= Cj exp [-£(y-pz*) 2 /(l-p 2 )]. . . (2) 

where c and c x are constants; and this is the ordinate 
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of a normal curve with mean at y = px k and of variance 
1 —p 2 , or in unstandardized scale cr|(l —p 2 ); this variance 
is the same for all such sections. The locus of the means 
of such sections is therefore the straight line y = px, or 
in unstandardized units y/a 2 = pxl<j v This straight line 
is the regression line of y on x. There is correspondingly 
a regression line x = py, or x\a 1 = py\a 2 , of x on y. 

The regression lines do not coincide unless p = ±1, 
in which case (with standard units) they are the bisectors 
of the angles between the x and y axes. If p = 0 the 
regression lines are the axes themselves; but the concept 
of regression is of little importance in this case. 

Note . The name “ regression ” was introduced by Sir Francis 
Galton (J. Anthrop. Inst., 15 (1886), p. 246). In bivariate 
data concerning heights of fathers, x Jt and heights of eldest 
sons, y it he found that the regression lines, as estimated from 
the sample, were approximately y = \x, x = \y. This implies, 
for example, that if there is a group of fathers whose heights 
all deviate from mean height by d inches, then the average 
deviation of the height of their sons from mean height is only 
id - There is thus a tendency, in the next generation, to 
return or regress towards the mean. But there is no deep 
and remarkable significance in this ; it is a mere consequence 
of the fact that neither p nor r can numerically exceed 1, 
and in practice the values r -- -f 1 or — 1 are never found. 

50. Correlation Table : Computation of Product- 
Moment. A contingency table of h rows and k columns 
in which both variables x and y are metrical is called a 
correlation table. If x and y are continuous variates it 
will be convenient to take a class-unit of suitable size 
for each and thus to have class-frequencies corresponding 
to class-rectangles. For practical purposes it is advisable 
to choose these units so that each variate has ten or a 
dozen classes, not more. 

The following example illustrates the usual appearance 
of a correlation table. (The distribution is of Binet 
Intelligence Quotient, x , and Verbal Score, y , of 500 Scottish 
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schoolgirls born in 1921, tested in the first week of June 
1932. The Intelligence of Scottish Children, Univ. of 
London Press, 1933, p. 96.) The score named 60 means 
60 and over, that is, the class 60 to 69, so that the class 
marked 60 in the report should be centred at 64*5; and 
so for other classes. The sums of rows and columns are 
entered in the margins ; they give the frequency distribu¬ 
tion of x when variation in y is neglected, and of y when 
variation in a; is neglected. 



x (Binet I.Q.) 

60 70 80 90 100 110 120 130 140 160 

fv 

70 

2 

2 

60 

3 2 6 3 4 1 

19 

y 50 

10 15 26 19 14 2 

86 

(Verbal 40 

2 7 32 43 23 7 2 0 1 

117 

Score) 30 

2 28 50 31 15 2 1 

129 

20 

10 32 38 6 1 

87 

10 

11 28 4 

43 

0 

3 7 7 

17 

7T~ 

3 32 102 134 98 67 34 22 6 2 

600 


From the marginal distributions we can proceed to 
compute the means and mean-square-deviation from 
means of x and y . This will always be the first step in 
computing r. The product-moment can be computed 
about provisional means, and then transferred by a 
correction to the true means, thus: 

Since ^ Sx = m 10 , i Ey = m' 01> 

the product-moment about these means is 

i ZZ(x-m 10 )(y-m Q1 ) = -i (EZxy-m[^y-m 01 Zx)+m 10 tn 01 

1 __ , , 

= Z2Jxy-m 10 m 0V . . . (1) 

Hence, just as mean-square-deviations can be com- 
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puted about a provisional mean and transferred (14) to 
the true mean by subtracting a square, so mean-product- 
deviations can be so transferred by subtracting the 
corresponding product of deviations of provisional means. 
It is to be observed that this product may be negative, 
in which case the correction involves an addition. 

Several different methods of computation are in use 
for finding r. We shall exemplify two, of which the rest 
are mostly variants. 

(i) The first method consists in computing EExy 
piecemeal according to the contributions made to this 
sum by the frequencies in the rows, or alternatively in 
the columns. For example, in the k th row, for y = y k 
constant, we compute E fjXj, that is, multiply each class- 
i 

frequency by the value of x, x it and add along the 
row. For the different rows we may enter these values 
EfjXj in a suitable column to the right. The sums of 

such values for all rows gives Ex , and so may be used 
to check the mean m' 10 ; while if we multiply each entry 
in that added column by its appropriate y k and sum down 
the column we have EExy . 

The same procedure may be carried out by columns 
instead of rows. We then have a check on both means 
and on EExy . The whole scheme can be neatly arranged 
in rows and columns annexed to the table as below. The 
special value of the arrangement is perceived when it is 
found necessary to compute correlation ratios (52) as well 
as correlation coefficients. It simplifies the arithmetic, 
too, to choose units such that the class-breadths of x 
and y are both unity. 

Ex. 1. By way of explanation of the entries, note 
that the second entry, 63, in the Ex column comes from 
3xl+2x2-f-6x3-j-3x4-f4x6-f 1 X6, while the second 
entry, —51, in the Ey row comes from 

2xl+2x0+10x(-l) + llX(-2)+7x(-3). 
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x 



-8 

-2 

-1 

0 1 

2 

3 

4 

5 

6 

4 

V vf 

v'S 

Zx v2z 

4 







2 



2 

4 8 

32 

8 32 

3 




3 

2 

6 

3 

4 

1 

19 

3 57 

171 

63 189 

2 




10 15 

26 

19 

14 

2 


86 

2 172 

344 

190 380 

V i 


2 

7 

32 43 

23 

7 

2 

0 

1 

117 

1 117 

117 

113 113 

0 


2 

28 

50 31 

15 

2 

1 



129 

0 0 

0 

39 0 

-1 


10 

32 

38 6 

1 





87 

-1 -87 

87 

-44 44 

-2 


11 

28 

4 






43 

-2 -86 

172 

-60 100 

-3 

3 

7 

7 







17 

-3 -51 

163 

-30 90 

/* 

3 

32 

102 134 98 

67 

34 

22 

6 

2 

600 

130 1076 

289 948 

X 

-3 

-2 

-1 

0 1 

2 

3 

4 

5 

6 





xf 

-9 

-64 

-102 

0 98 

134 102 

88 

30 12 

289 




X'f 

27 

128 

102 

0 98 268 306 352 150 72 

1503 

m 10 " 

289/500 - 0-678. 

ZV 

-9 

-61 

-102 

6 76 

80 

63 

47 

16 

4 

130 

m 01 ~ 

130/500 - 0-260. 

xZV 

27 

102 

102 

0 76 160 189 

188 

80 24 

948 

•?- 

1503/500 


-(0-678)» « 2-672. 


«2 - 1076/500—(0-260)* «= 2 084. 
- 948/500-(0-578)(0-260) - 1-746. 

Hence 

s x = 1 -635, = 1 *444, r = 1 *746/1 -635 X 1 *444 = +0*74. 


Error of Sampling.* It is desirable here to anticipate 
Chapter VII a little. The value of r is the estimate of p 
obtained from one sample of n bivariate measures. Now 
each possible sample will yield its own value of r, and the 
ensemble of such values constitutes the sampling dis¬ 
tribution of r. This, when the parent population is normal, 
is a skew distribution, first studied (78) by R. A. Fisher. 
As n increases it tends slowly to the normal type, with 
mean p and variance (1— p 2 ) 2 /(n— 1). Thus the standard 
error (68) of r is approximately (1— r 2 )/^n; but only 
when n is large, say n>100, and \p\ is not too great, 
say |p|<0*5, can we use normal theory. In most cases 
it is better not to use standard error, but to proceed as 
in 78. 

(ii) The second method of computing r depends on 
the simple observation that while by summing the 
frequencies in columns we obtain the distribution in x 
alone, and by rows that in y alone, if we sum along 
diagonals inclined at 45° to the horizontal we obtain a 

* This paragraph may be postponed until Chapter VII has 
been studied. 



METHOD OF DIAGONAL SUMS 


93 


distribution of x—y ; for all class-rectangles in any such 
diagonal correspond to the same value of x—y. Thus 
from diagonal frequencies we may compute the mean of 
x—y, namely m 10 —m' 01 , thereby checking the individual 
means as computed from row and column marginal 
frequencies ; and we may also compute the mean-square- 
deviation of x—y from its mean. Now the value of this is 

y (*-y) 2 -(S> -TO oi) 2 =«?- 2 ^ 2+4 

But s\ and are already known from the row and 
column marginal distributions ; hence r is easily found. 

Ex. 2. Taking the same example as before and summing 
along the diagonals, we find the frequency distribution of 
x—y to be 

x-y -3 -2 -1 0 1 2 3 4 5 N 

Nf 2 22 87 173 149 57 8 1 1 500 

The mean is found to be 0-318, checking the values 
m # 10 = 0*578, m Ql — 0-260. The mean -square-deviation from 
the mean is 

*f — 2r« 1 * 2 -f«| — 1-265, 

whence 

r = £(2*672+2*084 —1*265) /I *635 X 1*444 
£x3-491/2-361 = 0*74, 

as before. 

Notice that we have here no check on r. That could be 
provided by summing along the other set of diagonals at 
right angles to those which have been taken. They correspond 
to constant values of x+y, and so their moan-square - deviation 
from the mean is *2+2r* 1 * 2 +s2. 

Ex. 3. The distribution of x+y, obtained by summing 
along the other set of diagonals in the correlation table, is : 

x+y -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 N 
Nf 3 7 18 38 38 68 63 64 68 40 37 23 20 6 6 1 500 

Compute r from this distribution. Notice how much more 
widely spread it is than that of x—y in Ex. 2. 
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Sheppard’s Corrections. Sheppard’s correction for 
variance in grouped data is applicable to the mean-square- 
deviations of x and y, but not to the mean-product- 
deviation. On the whole, however, it is better to work 
without the corrections, because the tables of Fisher’s 
sampling distribution of r do not take account of grouping. 

61. Correlation of Variates with Poissonian Dis¬ 
tribution. It is not necessarily true that sampling from 
a fourfold population always produces as a limiting case 
a bivariate normal correlation function. Suppose, for 
example, that p and p' are of order l/n. Then p n may 
be of order l/n 2 , but may also be of order lIn. 

The f.m.g.f. of a sample of n individuals with replace¬ 
ment is seen from 47 (1) to be 

(l+pa+p'p+p n ap)». . . . (1) 

When p = [ijn and p f = \i\n, and p n is 0(l/n 2 ), this 
g.f. tends to exp (/xa+/x'jS), which shows that with in¬ 
creasing n the probability function reduces to the product 
of independent Poissonian functions, and is in fact 

4,(x, y) = e~h ^ . . . (2) 

On the other hand, when p n is 0(l/n), we have 
(1 +pa+p' j3+Pnap) n 

= [(1 +2*0(1 +p’ j8)(l +K^'«P+0(n~ 2 ))] n ) 

which tends to 

exp (/xa+/A'^+/Ia^), . . . (3) 

where 

A = MPn-PP') = n (PnPoo~PioPoi)- ■ • ( 4 ) 

Evidently p is the ordinary product-moment about the 
means. 

Now putting a = t—1, jS — u—1 in (3), we derive the 
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correlation function ip(x, y) as the coefficient of t x u v in the 
probability g.f. 


It is found without difficulty to be (in one of its forms) 
Mx, y) = (/*--/*)*(/*'-P) y 

* x ! v ! 


X 1 + 




x ! y ! 
p?x{x - 


iMv-D 


(/*—/*)(/*'—A) (z 1 - A )* 21 


•] 


( 6 ) 


where the polynomial in the bracket terminates after x +1 
or ^ + 1 terms, whichever is the lesser. This function is 
the bivariate Poissonian function. It may be proved that 
the loci of means of sections corresponding to constant 
x or y are straight lines, so that here again we have linear 
regression. The same property may be proved to hold 
for binomial and hypergeometric correlation functions. 

Both the normal function and the Poissonian correlation 
function can be derived, like the corresponding functions 
for one variate, on more general grounds than sampling 
from a fourfold table, by a compounding of elementary 
increments achieved by addition of bivariate cumulant 
g.f.’s ; but this derivation lies beyond our present scope. 


52. Non-Linear Correlation and Regression. A 

linear regression between correlated variates is rather 
exceptional. The loci of means of arrays usually deviate 
from straightness by more than can be ascribed to random 
sampling, suggesting that the underlying law of probability 
cannot be either normal or Poissonian. Non - linear 
regression curves are perhaps best estimated by fitting 
to the data suitable regression functions by the method 
of Least Squares, described in Chapter VI. In the 
non-linear case, too, the coefficient r or p has marked 
disadvantages (it was seen for example in 48 that p could 
be zero even when regression was perfect) and the cor¬ 
relation ratio t), devised by K. Pearson, is much to be 
preferred. 



96 PROBABILITY AND FREQUENCY IN TWO VARIATES 


It was proved in 49 that in normal regression all 
^-sections or arrays corresponding to constant x had the 
same variance, let us say 

= *i(i-A • • • (i) 

so that 

• ■ • ( 2 ) 

Now 1 —p 2 may be regarded as measuring something 
complementary or antithetic to correlation. The word 
alienation is sometimes used to describe this quality, but 
alienation suggests repulsion and is too strong a term. 
Residual dispersion expresses the meaning better. In 
non-linear regression the variance of the ^-sections, namely 

y)dy, . . (3) 

where 

y,= jy<l>( x > y) d yj J#*. • • • W 

the mean of the y-section corresponding to constant x , 
is not usually constant. We may, however, take the mean 
of these variances of y- sections over all sections, that is, 
over all values of x , obtaining 

<* 2 ,i = j j(y-y*) 2 <H x > y) dx dy> ■ • (5) 

which may be regarded as the mean-square-deviation of 
y from its regression value y x , taken over the whole 
distribution. Standardizing this by dividing by the total 
variance of y, namely a|, and writing 

1 -Vyx = <4,l l°l ■ ■ • ( 6 ) 

we define a coefficient r\ yx analogous to p in (2). This 
coefficient rj yx is the correlation-ratio of y on x. The closer 
it approaches 1, the smaller is the residual dispersion and 
the closer the values y lie to their regressional means. 
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In the same way, by interchanging the rdles of x and y 
in the above derivation, we define rj xy , the correlation- 
ratio of x on y . As to the signs of r\ yx and 7j xy , there are 
cases where these can be attributed by graphical or other 
considerations, but there are also cases, for example when 
the curve of regression is a periodic curve with several 
oscillations, when sign has no meaning. 

The estimates of rj yx and rj xyi as derived from an actual 
frequency distribution presented as a correlation table, 
will be denoted by e yx and e xy . We define them 
analogously ; thus 

K ■ • • (?) 

where s| 1 is the mean, over all y-arrays (columns of the 
correlation table), of the mean-square-deviation of y from 
the mean y x of the column. In computing this mean of 
mean-square-deviations the column frequencies, marginally 
entered, serve as class frequencies. The effective arith¬ 
metical arrangement of the computation will be given later. 

That the correlation-ratio is actually a ratio, namely 
the ratio of the standard deviation of the means of arrays 
to the total standard deviation of the variate, will now 
be proved by considering e yx . 

Lemma . If h sots of n lf n 2 , n k observations, 

with respective means M i and mean - square - deviations 
$2, j — 1> 2, k, are pooled in an aggregate of 

n = n 1 -j-n 2 -\-n k observations with mean M and mean- 
square-deviation s 2 , then 

ns 2 = -4-0,2), . . . (8) 

i 3 3 

where c f — M —M t . 

This follows at once from the fact that the mean- 
squarc-dcviation of the^** set about M is 

Applying this lemma to the column-arrays of a 
correlation table, we have 

ns\ = En^+M*), . . (9) 


a 
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where and sj are the mean and mean-square-deviation 
of the j m column. (The origin is the mean of both x 
and y.) This is the same as 

4 = 4 , 1 +% • • • (io) 

the second term denoting the mean-square-deviation of 
column means, when these are associated with column 
frequencies. 

The above result holds for the sample. A similar 
result can be obtained for the population, integrals 
replacing sums, and variance replacing mean-square- 
deviation from means. The result may be put in words 
thus : total variance of y is equal to mean of variances 
a 2 ,x °f y -arrays plus variance a 2 x of means of arrays. It 
is another example of analysis of variance (26, 75). 

Hence, by (6) and (7), 

tfx = <% a l and e lx = S V S I> • • (H) 

so that rj* x and e yx are displayed as ratios of variances 
or of mean-square-deviations. 

53. Computation of Correlation-Ratios. The result 
52 (11) permits us to compute e yx and e^ by a simple 
extension of the first method of 50 for computing r, for 
the means of rows and columns are given by the entries 
in the column headed Ex and the row headed Ey, divided 
respectively by the frequencies f v and /». Also, the means 
of these entries are w # 10 and m' 01 . Hence, computing mean- 
square-deviations from means in the usual way, we have 

e\ x = ^ Zf x (Zylf x )*-m’ 0 i\l8l 

= ^(^) 2 //*~ m oi]/ s l> • • (!) 

and similarly for e^. We thus annex two rows (Ey) 2 , 
(Ey) 2 //», and two columns (Ex ) 2 , (Ex) 2 f v , to the computation 
scheme for r. 
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Example. The additional rows and columns for the 
example of 50 (Binet I.Q. and Verbal Test Score) are as 
follows : 


( 2 y ) % 81 2601 10404 36 5776 6400 3969 2209 256 16 I 

(2y)'lfx 27 81 102 0 59 96 117 100 43 8 633 

e lx = [633/500 —(0-260>*]/(1*444)* 

= 1-198/2*084 - 0*575 

JL = [915/500-(0*578) , ]/( l *635)» 
xy 

- 1-496/2-672 - 0*560. 


(Zx)' (Zxfify 

64 32 

3969 209 

36100 420 

12769 109 

1521 12 

1936 22 

2500 58 

900 53 

915 


Hence e yx and e zy are equal to 0-76 and 0*75, whereas the 
value of r was found to be 0-74. 


54. Correlation of Non-Metrical Characters. When 
the characters in a double classification are purely quali¬ 
tative, capable of being graded by a recognizable difference 
in category, but not susceptible of measurement by metrical 
scale, we must fall back on the contingency table of h x h 
rectangular cells, with corresponding cell-frequencies. 
Since variances and product-moments are now out of the 
question, the presence or absence of correlation must be 
inferred from the cell-frequencies themselves, according 
to the manner in which they deviate from presumptive 
cell-frequencies in the corresponding case of independence. 

Consider, for example, the following contingency table 
due to Galton ( Proc. Roy. Soc ., 40 (1886), p. 42), illustrating 
the incidence of eye-colour in a group of fathers and eldest 
sons. 



B 1 

B, 

B 3 

E t 

¥’ 

Fi 

0-194 

0-083 

0-025 

0-056 

0-358 

(i) F a 

0-070 

0-124 

0-034 

0-036 

0-264 

F 3 

0-041 

0-041 

0-055 

0-043 

0-180 


0-030 

0-036 

0-023 

0-109 

0-198 

P 

0-335 

0-284 

0-137 

0-244 

1-000 
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E\ E 2 E 3 E 4 J p' 

E 1 

(ii) F 2 
F 3 
F 4 

V 

The colour categories are 1, blue; 2, blue-green or grey; 
3, dark grey or hazel; 4, brown, n = 1000. 

Summing down columns we obtain frequency estimates of 
the probabilities p of respective eye-colours for fathers 
irrespective of sons, and summing along rows, frequency 
estimates of probabilities p' for sons alone. These marginal 
frequencies or relative frequencies may be recombined 
again to form a multiplication table, which is to serve 
for comparison with the original table. The marginal 
frequencies in the second table are the same as in the first, 
but the cell-frequencies, derived as they are by applying 
the law of compound probability, represent what would 
have been the state of affairs with the same marginal 
frequencies had there been independence. Of course it 
must be observed that if we use, as here, not the a priori 
marginal probabilities but only the sample estimates given 
by the marginal frequencies, this procedure is bound to 
affect the sampling probability of the coefficient or criterion 
of comparison, ^ 2 . 

The coefficient is a quadratic function of the 
deviations of cell-frequencies in the actual from those 
in the presumptive independent case ; it is a kind of 
composite weighted variance, with application not merely 
to contingency tables but also to any comparison of actual 
frequency classifications, single or multiple, with pre¬ 
sumptive ones. It was first employed by Lexis, but the 
nature of its probability distribution was first obtained by 
K. Pearson in 1900. 


0-120 

0-089 

0-060 

0-066 


0-102 

0-075 

0-051 

0-056 


0-049 

0-036 

0-025 

0-027 


0-087 

0-064 

0-044 

0-049 


0-358 

0-264 

0-180 

0-198 



0-335 0-284 0-137 0-244 1-000 
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Distribution of ^ 2 . The derivation of the ^ 2 - 
distribution involves the general multivariate normal 
correlation function, which is outside the scope of this 
short book ; but the outlines may be sketched. If the 
a 'priori probabilities in the k classes are p v p 2 , p kf 
then the frequencies in the classes are characterized by 
the multivariate multinomial g.f. 

(M+J>8*2 + ...+i>A)“- • • • (!) 

If the number of individuals found in a class is n jt 

the expected number being np fi we may denote the class 
deviation from mean value or expectation rij—npj by e f . 
Then, since En i = n and also Snpj = n, we must have 

*1 = 0 , . . . . ( 2 ) 

a relation in virtue of which only k —1 of the deviations 
let us say the first k— 1, are independent. We therefore 
put t k = 1 in the g.f. and consider what happens as n 
increases. Putting t j == e a i , we find that, provided no p j 
is the multivariate m.g.f. of the class deviations 

tends to 

k- 1 

exp [\n Z(vapjl-2pipp.fi.,)] . . (3) 

t, j - 1 

This is an m.g.f. of normal correlated probability in 
the k —1 deviations, which on reversion gives the probability 
differential of the €j as 

k 

cexp[—ln-'Zc?lp l ]de 1 d€ 2 ...de k _ v . . (4) 

i-i 

Thus the probability, or probability density, of a set € i 
of deviations is a function of the quadratic expression 

= &*lnp„ . . . ( 5 ) 

which is Pearson’s ^ 2 . Having decided to use the com¬ 
posite x 2 rather than the individual deviations as a 
criterion of the nearness to expectation, we transform the 
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differential (4) into a differential in itself, when it 
assumes the shape 

dp = c x k - 3 e- iX *d x 2 , . . . (6) 

the probability function being of Pearson’s Type III or 
Gamma type. 

The probability of obtaining a value of not exceeding 
a given xl is therefore 

P X 1 = f °x k - 3 e- iXt dx t l f x*-*e~**V- • (7) 

Jo Jo 

Tables of this function P have been computed for various 
values of k , the number of classes, and x 2 . 

Degrees of Freedom in ^ 2 . When the class pro¬ 
babilities p if are given a priori the distribution of for 
k classes is expressed, as we have seen, by 

dp = c x k - 3 e-i x, d x K . . . (1) 

But the presumptive class probabilities are not always 
given a priori ; in a contingency table, for example, they 
may be estimated by recombining in multiplication the 
marginal relative frequencies of the table which is being 
tested. Now such a procedure forces the marginal totals 
of the presumptive table of independence to agree with 
those of the contingency table. This forcing reduces the 
number of independent class deviations from expectation. 
For example, in a 4-by-6 table there are 24 classes, of 
which 23 have independent frequencies, since the total of 
relative frequencies must be 1. This is in the absence 
of forcing. On the other hand, if the 10 marginal totals 
are preassigned, then there are only 3x5 or 15 inde¬ 
pendent class frequencies, as may be seen by putting these 
15 in the top left part of the table, so as to fill 3 rows 
and 5 columns, and observing that all the others can then 
be filled in by reference to the marginal frequencies. 
In general, in an hxj table with forced marginal 
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agreement, there are only (h—l)(j— 1) independent class 
frequencies. 

Now, in preparing for comparison by the ^ 2 -test in such 
a case, we should not integrate c exp (— Jx 2 )^ € i^a ••• dfe *-1 
over all the previously independent for by so doing we 
would be unfairly including combinations of the c* which 
have been precluded by the procedure of forced agreement. 
We ought to transform so that it is expressed in terms 
of the restricted set of independent € f . It was shown by 
R. A. Fisher that when this is done the modified element 
of probability is simply 

dp = cx k ~ m ~ 3 e~* xt dx 2 , . . (2) 

where m is the number of restrictive relations, reducing 
the number of independent e, from k— 1 to k—m — 1. It 
is usual to call k—m — 1 the number of degrees of freedom. 

The table of P(x 2 ) is therefore best constructed, and 
consulted, with reference not to k, the number of classes, 
but to k—m — 1, the number of degrees of freedom ; and 
this applies not only to contingency tables but to all 
situations in which a presumptive probability distribution 
is obtained from a frequency distribution by a partial 
forcing of agreement, the equating of moments for example, 
involving restrictions on the deviations e,. These restric¬ 
tions must be linear, that is to say, they must involve the 
€j in the 1st degree only. 

Since in the deduction of P(x 2 ) we excluded the case 
of very small class probabilities, we must exclude in 
practice small class frequencies. It is customary, there¬ 
fore, in applying the test, to pool the small frequencies 
at the ends of a distribution so as to make the classes 
oontain at least 10 individuals. 

Example. The fitting of Poissonian and Type B functions 
to the Ruthorford-Geiger data in 42. We pool the classes 
corresponding to a? = 10 and over. Thus A? = 11. 

For the Poissonian fitting there are 9 degrees of freedom, 
since the total frequency and the mean have been made to 
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agree in fitted curve and data. We find x 2 = 12*8, and 
reference to tables shows that P — 0*20, a satisfactory value. 

For the Type B there are 8 degrees of freedom, total 
frequency, mean and second factorial moment having been 
made to agree in fitted curve and data. We find x 2 — 10*2, 
P = 0*25. The slight improvement is of little consequence ; 
in both cases the principal contribution to x 2 comes from 
the large deviation in class x — 8. 

Empirical Formula. The value of x 2 for which 
P = 0*05 is often regarded as a boundary between the 
reasonable and the dubious. This value of is given 
with adequate approximation, for k' degrees of freedom, by 
1-55(4'+2), 4'<10, and 1-25(4'+5), 4'>10. 

For k' — 35 the second formula above gives the value 50, 
the actual value of x 2 being 49-80. For higher values of k' 9 
<y/2x 2 ~ V 2k' — £ may be treated as a standard normal variate. 

55. Coefficients oi Contingency. The possibility of 
dependence between variates in a contingency table can 
be tested by P(x 2 )- For Galton’s data of eye-colours in 54 
the value of x 2 is 266, a value so large that the probability 
of independence of eye-colour between fathers and eldest 
sons is negligibly small. 

Attempts have been made to measure the strength of a 
dependence by means of coefficients of contingency. Thus 
X 2 measures, as it were, the dispersion of a grouped sample 
from expectation, taken over all n individuals ; and so 
the mean dispersion per individual is x 2 l n > a coefficient 
denoted by <f> 2 and called by K. Pearson the mean square 
contingency. Since 

= x 2/ w ==2’(€,/w) 2 /p„ • • (1) 

it appears that <f > 2 is the sum of squared deviations of class 
relative frequencies ejn from the presumptive class 
probabilities p )f each divided by that probability Pj. 

Pearson, considering the value of <f> 2 for a bivariate 
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normal correlated distribution divided into grades of 
indefinite fineness in x and y , found the relation 

** = P 2 (1 -P 2 ). /> 4 = W+n • (2) 

and, proceeding by analogy, defined a general coefficient of 
mean square contingency C by 

0* = *V( !+**)• • • • (3) 

Evidently C 2 is zero when <f> 2 is zero and tends to 1 as <j> 2 
increases ; but its interpretation for intermediate values is 
not very definite. 

Example. The computation of <f > 2 and C 2 for Galton’s 
data in 54. 

The table of values ( Pn—PiP^) 2 IPtP * is : 



Ex 

E t 

Ex 

E 4 


Fx 

0-046 

0-004 

0-012 

0-011 

0-073 

F* 

0-004 

0-032 

0-000 

0-012 ! 

0-048 

F 3 

0-006 

0-002 

0-036 

0-000 

0-044 

F< j 

0-020 

0-007 

0-001 

0-073 

0-101 


0-076 

0-045 

0-049 

0-096 

0-266 


Thus <£ 2 = 0-266, C 2 = 0-266/1-266 = 0-210, C = 0*46. 

Table o! P(y 2 ). A table of P(x 2 ), arranged in a 
compact and practical form, is given in Table III of 
R. A. Fisher’s Statistical Methods for Research Workers , 
8th edition, pp. 110-111 ; also in the Statistical Tables for 
Biological , Agricultural and Medical Research of Fisher and 
Yates (Oliver and Boyd, 1943), p. 27. 

For practice in the x 2 -test, tho reader may examine 
whether tho experimental data of tho examples on pp. 49 
and 50 are in good accord with the theoretical distributions, 
rectangular and binomial, there given. 
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THE METHOD OF LEAST SQUARES: MULTIVARIATE 
CORRELATION : POLYNOMIAL AND HARMONIC 
REGRESSION 

56. Multivariate Regression. When distributions in 
more than two correlated variates are encountered, an 
important question is the determination of the optimal 
value (sometimes in the sense of mean value, sometimes 
in the sense of most probable value) of a particular variate 
in terms of the values of all or any given set of the other 
variates. We have seen that in normal bivariate dis¬ 
tributions the loci of such optimal values are straight 
regression lines. In normal correlation of many variates 
the corresponding loci are still linear, expressed by 
equations of the first degree. For three variates there 
are three planes of regression, for n variates there is a 
sheaf of n hyperplanes, each given by a linear equation 
expressing a particular variate in terms of the other n —1 
variates. 

It was proved by Yule that these various linear loci 
could be obtained without the assumption of normal 
distribution by using the method of Least Squares, which 
we now describe. 

67. The Method of Least Squares. The method of 
Least Squares originated in the practical necessity of 
combining discrepant observations of a single unknown 
constant, or discrepant observational equations in several 
unknowns, in such a way as to obtain best estimates of 
the unknown or unknowns, under some accepted criterion. 

100 
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Discrepant measures are inevitable in repeated obser¬ 
vations, even when every effort has been made to keep 
conditions constant. The conditions can never be identi¬ 
cally realized a second time. However delicate the 
instrument of measurement, there are innumerable fine 
and uncontrollable variations inherent in its parts and 
their adjustment and the readings, to say nothing of the 
inaccuracies of the observer. Hence, just as in the 
throwings (4) of a coin, we have varying phases of a 
system S . Thus repeated measures of a supposedly 
unique physical constant are found to be discordant, 
the truth being that they are a sample from a certain 
probability distribution depending on S. In the same 
way, when linear combinations or other functions of 
several unknowns are measured, the number of observations 
exceeding the number of unknowns, the equations so 
derived are nearly always found to be inconsistent. 

In 1805 Legendre proposed, as a convenient method 
for reducing certain astronomical observations, that the 
“ best value ” should be taken as that for which the sum 
of squared deviations of the observations was least. This 
is the principle of Least Squares. It can be justified under 
the assumptions (i) that the measures are normally 
distributed and (ii) that the best value has maximum 
'probability density. This derivation is mathematically the 
simplest and most rapid, but it unduly limits the types 
of error distribution. A more comprehensive derivation 
postulates that the best value is (i) a consistent or unbiassed 
linear combination of the observations and (ii) has minimum 
variance . It is remarkable that the two quite different 
sets of postulates lead to exactly the same equations for 
the unknown or unknowns. 

58. Precision, Weight, Errors and Residuals. 

Measuring instruments of differing precision may be 
characterized by their standard error, or variance of error, 
in the reading given by them of some assigned measure. 
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The variance may be estimated by repeated trials. It is 
traditional here to use the term weight , defined as propor¬ 
tional to the reciprocal of variance of error. For example, 
if in determining a distance of 5000 yards the standard 
error of a range-finder A is estimated to be half that of 
a range-finder B , the weights w A and w B assigned to 
readings made by A and B would be as 4 to 1, in favour of A . 

Finally, it must always be kept in mind that “ true ” 
values (if indeed the word “ true ” admits at all of definite 
meaning) are unknown and must remain unknown; so 
that the errors, being deviations from an unknown value, 
are likewise unknown. True values must be estimated by 
appropriate substitutes, namely, best or optimal values, 
and errors by the deviations of the observed from the 
optimal values. These deviations are distinguished from 
the errors which they represent by being called residuals . 
Errors are c, = a 5 — a, residuals are e i = a f — a, where a 
is the true value, a f an observed value of a, and a the 
optimal value of a. If there are n observations, the n 
residuals are estimates of the n errors ; and the n errors 
are themselves only a finite selection under the law of 
probability which characterizes the circumstances of 
measurement. 

59. Repeated Measurements of a Single Unknown. 

The estimate by Least Squares is found by minimizing 
the sum of weighted squares of residuals. The minimum of 

S 2 — Uw^Zj—x) 2 . . . (1) 

i 

is given by dS 2 /d& = 0, so that 

A — EwjXjjEwj. . . . ( 2 ) 

The optimal value of x thus appears as a weighted mean 
of the observations. If the observations are all of 
equal weight the optimal value is thus the arithmetic 
mean. 
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Variance of Optimal Value. The variance of £ in 
(2) is (15) 

2Jwrap(Zwj) 2 = (Zwja-lw^KEwj) 2 = cP/Ewj, . ( 3 ) 

where a? is the variance of x i and a 2 is the variance of 
an observation of unit weight. Thus the weight of x is 
Sw^ the sum of the weights of the Xj. In particular 
the weight of the arithmetic mean of n values x j all of equal 
weight, is n times the weight of any x 5 . 

Variance of Residuals in Case of Equal Weight. 
If the observations are all of unit weight the j th residual 
e< is 

x,—x = (n—^Xi/n—(x 1 +x 2 +...+x n —x i )ln. . (4) 

Thus the variance of e* is (15) 

(n—l) 2 a 2 ln 2 + (n—l)or 2 /n 2 == (n—l)cr 2 /n. . (5) 

It follows that an estimate of a 2 is given by dividing 
the sum of squared residuals not by n but by n— 1. 

Ex. 1. The author made 30 bisections by eye of lines of 
constant length. The distribution of x , the length in cm. of 
the segment to the left of the point of bisection, was : 
x 7-6 7-65 7*75 7-8 7-85 7-9 8*0 8-1 8*15 8-2 8*25 8-45 n 
nf 2 3 1 44242322 1 30 

Estimate the length of the half line and the standard error. 

Ex. 2. Do the same for the results given by a second 
person : 

x 7*7 7-75 7-8 7*85 7*9 7*95 8*0 8-05 8-1 8*15 8-2 8-3 n 

nf 1 1 1 4 3 5 4 5 3 1 1 1 30 

Ex. 3. Compare the precision of the two persons by 
assigning weights. By a weighted combination estimate the 
length of half the line from all 60 bisections, and assign a 
standard error. (The length of the line was actually 16 cm.) 

60. Indirect Determinations from Linear Equa¬ 
tions. In this case we have measurements of n linear 
functions of m unknowns, where n exceeds m. Because 
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of observational error the equations are inconsistent. For 
example, we might have 


Observations. Weights. 

x = 1-75 2 

x+y — 3-10 1 

x+y+z = 3-85 4 

y+z+u = 4-30 2 

z-\~u = 3-05 3 

u = 2-10 1 


• ( 1 ) 


In such a case the method of Least Squares consists 
again in taking as optimal values those for which the 
sum of weighted squares of residuals is a minimum, so that 
for example to solve the equations (1) we would minimize 

S 2 = 2(# —1*75) 2 + (#+2/—3-10) 2 + 4(#-f3’85) a (2 ) 

+2(^+z+u-4-30) a + 3(z+u-3-05) 2 + (u-2*10) 2 


with respect to x y y y z>u. More generally, if the equations 
are (to take the case of 4 unknowns) 

a x x + b x y + c x z + d x u = h v weight w x , 

“f* ”4“^2^ d^U = A 2 > ••• ^2» 

a m *+&m2/+c m z+i m M = h m , ... w n , 
we minimize 

m 

S 2 — ZiVjiajX+bjy+CjZ+djU— h,) 2 , . 

j=i 

and similarly for any number of unknowns. 

The partial derivatives dS 2 /dx , dS 2 /dy , dS*ldz , dS 2 /du 
must be zero ; and so we derive the equations 

(Ewftyx -\-(Ew j a j b j )y-\-(Ew i a j c i )z-\-(Ew j a i d j )u=Ew j a j h jy 
(Ew i a i b i )x+(Ew j b j 2 )y +(Ew j b j c j )z+(Ew j b j d j )u—Ew j b j h j , 

(. Ew i a 9 c i )x-\-(Ew 1 b j c i )y-\-(Ew j c^)z -^(Ewfjd^u—Ewfjhj, * * 
(Ew/i/i i )x+(Ew j b j d j )y+(Ew j c f d j )z+(Ewjd^)u —Ewjdjhf, 

for x , y, z , u. These are called the normal equations, and 
their general form is similar to the above. Inspection will 


. (3) 

• (4) 
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show that the coefficients in the normal equations are 
symmetrical , in the respect that the coefficient of the j th 
unknown in the k th equation is identical with that of the 
k th unknown in the j th equation. The scheme of coefficients 
ip in fact symmetrical about its northwest-to-southeast 
diagonal. This symmetry is of great service in shortening 
the solution of the equations. 

Thus in our numerical example the normal equations will 
be found to be 

7x + 5y+4z = 22-000, 

5a:+ 7 y + Qz+2,u = 27-100, • . (6) 

4x + 6?/ + 9z-f 5u = 33-150, 

2^-f 5z-f Qu = 19-850, 

which can now be solved by methods of practical algebra. 
The solutions are a; = 1-750,2/ = 1-274,2 = 0-846, u = 2-178. 

Various schemes of systematic solution of normal 
equations have been devised, and for these the reader 
must be referred to more comprehensive treatises and 
original memoirs dealing with Least Squares or with the 
numerical solution of algebraic equations. 

Preparation of Normal Equations. It is evident 
from the construction of the sum S 2 of weighted and 
squared residuals that exactly the same sum would arise 
if we multiplied each observation throughout by the 
square root of its weight, \/w j} and then regarded the 
observational equations as of equal unit weight. (Let the 
reader verify this from the example.) Such a reduction 
of a set of equations with unequal weights to a set with 
equal weights is called preparing the equations. 

61. Application of Least Squares to Trivariate 
Correlation. Suppose that we have n trivariate obser¬ 
vations (x jy y jt Zj), as for example the height, weight and 
chest measurement of each of 1000 soldiers, and that we 
wish to express each variate as the best possible linear 
estimate of the other two. We may suppose the variates 
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measured as deviations from their respective means, and 
standardized. Thus for x we have n equations 

x i = & 12 y>+ b is z i> (j = 2 > 3, n). . (1) 

These may be regarded as n observational equations in 
the two unknowns b 12 and b 13 . If we solve them by least 
squares we shall have the desired optimal relation 

x = 6 12 2 /+ 6 13 2 ;, . . . (2) 

which may be regarded geometrically as the regression 
plane of x on y and z . The coefficients b 12 and b 13 are 
called regression coefficients ; they are the sample estimates 
of ideal regression coefficients /? 12 , j8 13 in the underlying 
population. The normal equations for b 12 and b 13 are 
obtained by minimizing the sum of squared residuals 

= £(Xj~bi$j—bi Z Zj) 2 . . . ( 3 ) 

3 

The minimum conditions dS 2 /db 12 =0, dS 2 jdb l3 =0 
give, on division by n, 

^12+ r 23^13 “ r i2> ^ /jx 

r 23^12+^13 == r i3> 

where r 12 = Zxjyjn, r ls = Exfo/n, r 23 = 2yp s ln. 

Solving, we find the desired regression coefficients as 

^12 === ( r 12 r i3 r 23)/(l / K v 

^13 = ( r 13~ r 12 r 2z)l(l~ r 2z)’ 

and similar results hold for the regression of y on x and z, 
and of z on x and y. 

The standardized mean-product-deviations r 12 , r 13 and 
r 28 are usually called total correlation coefficients of x and 
y, x and z and y and z respectively. They are really 
estimates from sample of the corresponding mean-product- 
deviations, or product-moments p 12 , p 13 and p 23 in the 
trivariate population or probability function. 
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It may be proved that the trivariate normal m.g.f., 
in standard scale and with means as origin, is 

expi(a t +p t +y i +2p lt a^+2pi S ay+2p 23 Py) . (6) 

and by reversion that the corresponding trivariate normal 
function is 

y> 2 ) 

= (2 w )-?A-iexp [-*A- X {(1 -/•&)*•+(1 -Pi 2 3 )y 4 +(1 -Px!)* 2 
%{pn PiaPaa) x y %(Pi 3 PiaPaa) xz ~ ^(Paa PiaPia)y z }]> (7) 
where A is the determinant 

1 P 12 Pi 3 

A = Pi 2 1 P 23 

Pi 3 P23 * 

of total correlations. 

The equations d<j>ldx = 0, d(j>/dy = 0, ckf>/dz = 0 give 
the loci of maximum probability of x for fixed y and z, 
of y for fixed x and z, and of z for fixed x and y. By 
actual differentiation we find these loci to be 

x = Pi2y~\~fiis z ■ • • ( 8 ) 

and two others, where 

Pl2 — (Pl2 — Pl3P23)/(l"~P23)> /Q\ 

/^13 == (Pl3 Pi 2 P 23 )/(^- P 23 )’ ; 

Thus we see that the estimates of regression by Least 
Squares are in agreement with those based on normal 
trivariate correlation. A corresponding result is true for 
linear regression in any number of variates. 

62. Partial Correlation. The unstandardized equa¬ 
tions, with means as origin, of the regression lines in 
bivariate regression (49) are 

x = Pirf/, where £ 12 = pcq/ 02 , 
y = fi 21 x, where £ 21 = pa 2 l<J v . . (1) 

The correlation coefficient p appears here as the 
geometric mean (j8 12 j9 2 i)** On analogy of this* partial 

H 
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correlation coefficients in multivariate problems have been 
defined as the geometric means of the corresponding 
regression coefficients. For example, the partial coefficient 
of two variables x h and x k would be defined by (/? h kPkh)** 

Notation. It is customary to denote, for example in 
a four-variate problem, the partial correlation coefficient 
of x and y by p 12 , 34, to distinguish it from the total 
correlation coefficient px 2 - The sample estimate would be 
written fjjj 34 * 

Example. Given the following estimates of variances and 
total correlations of three variables x, y> z, find the three 
regression equations and the three estimates of partial 
correlation coefficients: 

<j\ = 5-0, o\ = 7-0, a\ = 3-0, r n = 0-80, r 13 = 0-40, 

r 23 = 0<60 - 

63. Non-Linear Regression : Polynomial Regres¬ 
sion. From the nature of a set of observations of a 
variate y dependent on x it may be apparent that the 
regression cannot be linear. Common types of non-linear 
regression are those in which the underlying functional 
relation of y and x is of polynomial, or of harmonic type. 

The polynomial regression 

y = Co+Ci^+c^+.-.+Cfca;* . . (1) 

will be considered first in its simplest case, the fitting 
of the polynomial by Least Squares to n independent 
observations u* of equal weight, corresponding to n 
equispaced values of x, namely x = 0 , 1 , ..., w— 1 . 

The polynomial of best fit is given by the minimum 
of the sum of squared residuals 

S* = £(u—c 0 —CiX—c 2 x 2 —...— c k x k ) 2 , . (2) 

X 

that is, by the conditions dS 2 ldc j = 0. These give k+1 
normal equations for the c j9 easily seen to be expressible as 

2xi(Ux—y x ) = 0, (j = 0, 1, 2, ..., &), . (3) 
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displaying the fact that the fitting of a polynomial of degree 
k by Least Squares is equivalent to equating the moments 
of orders 0, 1, 2, k of the polynomial and the data. 

The values of the coefficients c i can be found by solving 
the normal equations, but since sums of powers of natural 
numbers up to the 2k th are required, the method becomes 
laborious if n is large and if the polynomial y is of the 
3rd or higher degree. For this reason it is better to 
express y not in powers of x , but in polynomials 
1, t^x), t 2 (x), ..., t k (x) having the property of being 
uncorrelated, being in fact such that the product sum 

Et ( {x)tf(x) = 0 if i ^ j. . . . (4) 

X 

These polynomials t f (x) are familiar in mathematics as 
the orthogonal polynomials of Tchebychef, and their 
properties are known. For example, it is known that 

t f (x) = (2j)(j)Z(/)—(2j j)xu-D 

+(2j—2)</)(n—j+l) (2 ).i - (j-2)) —...+(— )*(ra—1) M >, . (5) 

so that (Appendix, 1) the s th difference 

A n r (x) = (2j) (,-,*( j-j) -- (2j—1) <« (» — 

+ ••• + (—) 5fs ( j+ s )(»( re— (6) 

It is also known that 

£(M*)) a = »(»*-l)(i^-4)...(*»-i*)/[(2;+])(i!)«]. (7) 

X 

If, therefore, we express y in the form 

y = a 0 +a 1 t 1 (x)+a 2 t 2 (x)+...+a k t k (x), . . (8) 

the sum S 2 of squared residuals, because of the vanishing 
of the product terms, takes the form 

S 2 = 2 [«» —a 0 -a^x )—... —a^xj ] 2 

X 

= Z [K) 2 -2 u x {a 0 +<hk ( x )+• • • +«*<*(*)) 

+Oq + ... +(<*]c t k( x ))"] • (9) 
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and the normal equations 3S 2 /da i = 0 therefore take the 
form 

a^ZuvWzmUx))*. (j = 0, 1, lc). . (10) 

Thus each coefficient in the regression is found independ¬ 
ently of the others, without the labour of solving 
simultaneous equations. (The choice of uncorrelated or 
orthogonal functions for the representation of u x always 
confers this very great advantage.) Since the polynomials 
t f (x) are expressed in factorials x (r) , the numerator of the 
expression for can easily be found in terms of the 
factorial moments of the data Ux , these moments being 
obtained as usual (Appendix, 2) by summation. 

The minimum sum of squared residuals can itself be 
evaluated beforehand, for by (9) and (10) it takes the form 

E\. u l~ a 

= Zul-aJ:u x -a 1 Zu x t 1 (x)-...-a k £u !c t lc (x), (11) 

involving the sum of the squares of the u X) diminished 
by the product of each successive a f by the numerator 
in (10). It is known that the variance of a single residual 
is best estimated by dividing the sum of the n squared 
residuals by the degrees of freedom, n—lc— 1 ; hence we 
can judge beforehand, if we know the precision of the data, 
what value of Jc gives the best polynomial y. It is of 
course possible, by taking too many terms in the polynomial 
y , to fit the data too well, in the sense that the sum of 
squared residuals is much smaller than that warranted by 
the precision of the data. 

64. Practical Routine of Fitting a Polynomial. 

All of the above points, which can be treated only briefly 
here, have been discussed at length in special memoirs. 
We shall merely illustrate a method depending on the 
theory of 63 and making use of a table containing the 
terminal values and differences ^(0), A^(0), A a ^(0), ..., 
for j = 0, 1, 2, 3, ...» k and the particular value of n. 
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The rule for constructing such a table follows from (5) 
and (6) and is simple. We shall illustrate it for n = 6. 
We write down the fixed table of binomial coefficients, 
table (i) below, to &+1 columns ; in the illustration, k = 3. 
Beside table (i) we place table (ii), consisting of binomial 
coefficients of n— 1, n— 2, ... written below each other as 
shown, also to &+1 columns. The products of corre¬ 
sponding entries in the two tables now give us the desired 
table (iii) of terminal values and differences of ^-polynomials, 
and at the feet of the respective columns we enter the 
values of 2Jtj, as computed from the formula 63 (7). 


1-1 1-1 
2-3 4 

6 -10 
20 

( ii ) 

1 5 10 10 

1 4 6 

1 3 

1 

1 -6 10 -10 

( iv ) 

1 —5 6-5 

2 -12 24 


2 -6 12 

6 -30 


3 -15 

20 


10 


6 70 336 720 6 70 84 180 


A possibility making table (iii) still simpler for practical 
use is that when a common integer factor is observed in 
any column, we may cancel through by that factor, 
provided that the square of that factor is cancelled through 
from Thus the cancelling of factors 2, 2 from columns 
3, 4 in table (iii) above gives table (iv). Such tables, 
extended to six or seven columns, are easily constructed 
for a proposed value of n. 

The use of the table in finding the regression coefficients 
a i and the fitted values y a is best illustrated by an actual 
worked example. The process is no more difficult for a 
long series of data than for a short, but to economize in 
space we shall illustrate it by fitting a cubic polynomial to 
six values u*. 
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Example. 

x 0 1 2 3 4 5 

u 5 13 25 60 105 200 

By summation the reducod factorial moments of u are 
found to bo 408, 1663, 2835 and 2480, while Su* — 55444. 

Using four columns only of the table of polynomials (since 
we are fitting a cubic) we set out the rest of the work in 
compact shape thus : 

408 1286 567 191 

- a i 68 18-371 6-75 1-0611 y 0 A y 0 A a y 0 A s y 0 


4-590 

8-975 

4-334 

10-611 

6 70 84 180 Check y 5 = 198-91. 

Explanation . 

a 0 = (408 X l)/6 = 68. 

= (1663 X 2—408 X 5)/70 = 1286/70 = 18-371. 
a 2 = (2835 X 3 —1663 X 6-{-408 X 5)/84 = 567/84 = 6-75, 

and so on ; the elements in columns of the table are used as 
multipliers of the factorial moments, the entries at the feet of 
the columns as divisors. Then 

y 0 = 68 X1—18-371 X 5-f-6-75 X 5 — 1 *0611 X 5 = 4-590. 
Ay 0 = 18-371 X 2 — 6*75 X 61*0611 x 12 = 8-975. 

A 2 2/o =» 6-75 X 3 — 1-0611 X 15 = 4-334, 

and so on ; the elements in rows are now used as multipliers 
of the a Jt and give the terminal value y 0 and its differences. 
There is also a good check on the other terminal value, 

y | = 68 X l-f-18-371 X 5+6*75 x 51-0611 x 5 = 198-91, 

the same terms as gave y 09 but with positive multipliers. 

Building up a difference table of the y x from the constant 
3rd differences in the way familiar in interpolation, we have— 


Sums 

1-5 5 -5 

408 

2 —6 12 

1663 

3 -15 

2835 

10 

2480 
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X 

y 

Ay 

A 2 y 

A*y 

u 

u—y 

0 

4-590 

8-975 



5 

0-410 

1 

13-565 

13-309 

4-334 

10-611 

13 

-0-565 

2 

26-874 

28-254 

14-945 

10-611 

25 

-1-874 

3 

55-128 

53-810 

25-556 

10-611 

60 

4-872 

4 

108-938 

89-977 

36-167 


105 

-3-938 

5 

198-915 




200 

1-085 


The comparison of the fitted values with the data can be 
seen in the columns headed y and u. The sum of squared 
residuals (u— y ) 2 will be found to be 44-4. 

But we can also set out a table thus, estimating by 63 (11) 
the variance of a residual after a constant, a straight line, 
a parabola and our cubic are fitted in succession : 


k 

n—k — l 

Ofc 

num. of a k 

prod. 

S 2 

55444 

1 

1 

£ 

0 

5 

68 

408 

27744 

27700 

5540 

1 

4 

18-371 

1286 

23626 

4074 

1019 

2 

3 

6-75 

567 

3827 

247 

82 

3 

2 

1-0611 

191 

203 

44 

22 


The column headed S 2 shows the sum of squared residuals, 
obtained in accordance with 63 (11) by subtracting the 
entries in the previous column in turn from Hu 2 = 55444. 
The last column gives estimates of the variance of a single 
residual at the different stages. To test which polynomial 
best represents the data, we must have a preliminary 
knowledge or estimate of the variance of the observations. 
This variance is compared with the residual variance in 
the light of the sampling distributions of 71 and 74. 

The alternative computation of the sum of squared residuals 
as 44 checks the work, for the same sum was given by the 
fitted values as 44*4. 

For a given value of n the same table of terminal values 
and differences of ^-polynomials serves for fitting a poly- 
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nomial of any degree. Thus, using only the first three 
columns of the table in the above worked example, we 
may fit a parabola instead of a cubic. It will be found 
instructive to do this, following the details of the worked 
example. Notice that the coefficients a 0 , aa 2 are the 
same as before. 


Example. Fit a cubic polynomial to the seven equidistant 
and equally weighted data 

x 0 1 2 3 4 5 0 

u -11 5 13 25 60 105 200 

65. Periodic Regressions : Observations of Equal 
Weight. Observations which exhibit periodicity more or 
less masked by accidental error are of common occurrence. 
The height of tide-water at a seaport, measured at equal 
intervals of time, shows such a periodicity ; monthly 
averages of temperature show a seasonal periodicity; 
telephone calls on an Exchange show a weekly periodicity. 

The procedure for analysing periodicity is to assume a 
periodic function 

_ +«icos0 -fa 2 cos20 -f... -fa* cos £0 

V/j — . (1/ 

9 +6 1 sin0 +6 2 sin20 +... +6* sin 1c8 

and to find the coefficients a i and b i of the constituent 
periodic terms by the method of Least Squares. 

We consider therefore n equally spaced observations 
Uq of equal weight, where 0 = 0, 27r/n, 47r/n, ..., 2(n— 1)7t/ti, 
the observations thus corresponding to the n phase-angles 
of one complete oscillation of a periodic phenomenon. 
The initial observation of a second oscillation is not 
included. In view of the trigonometrical relations 


-»-“*’** ( 2 ) 

n n = \n, if h 


and the similar ones with one or both cosines replaced by 
sines (these are really orthogonal relations exactly 
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resembling those of the Tchebychef polynomials in 63 (4) 
and (7)), the sum S 2 of squared residuals is (c/. 63 (9)) 

2 ( u 0 —y $) 2 = 2u^(a 0 +a 1 cos 0+...+6 t sin k0)] 

0 0 

+\n{2al+a\+...+bl). . (3) 

Differentiating with respect to the a f and b f and 
equating to zero, we have the normal equations for the 
regression coefficients. Each is given independently of 
the others. 

12 2 
a 0 = —2u a , a h = — Hu* cos hd , b h — —Zu a sin hd. . (4) 

n n O n n ” Tin ° 


If n is even, 

a ln = — Eu a cos indy 
1 n 0 0 


f>in - 0 , 


• (5) 


and cos \nd is +1 and —1 alternately as 0 takes its n 
values. 

The theoretical solution is thus immediate. Simplicity 
of practical application will depend on the value of n t and 
the consequent values of cos hd and sin hd. 


66. Practical Solution of the Normal Equations. 

The process of numerical solution becomes specially simple 
when 0 y that is, 27r/n, is such that cos hd and sin Jid are 
easy to handle. This occurs when n — 4, 6, 8, 12 or 24, 
the last two cases being specially important, as corre¬ 
sponding to the hourly or two-hourly subdivision of the 
day ; and special routines for these values of n have been 
devised. 

The procedure depends on the fact that in the four 
quadrants, from 0 = 0 to 0 = 27t, cos d and sin 0 take the 
same absolute values four times, though with differing 
alternations of sign. To take the case n — 12 for illus¬ 
tration, the data uq (and there will be no misunderstanding 
if these are written meanwhile as u 0i u 1 , ..., can be 
assembled in tetrads, for example v^+u b —u 1 --v 11 y before 
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being multiplied by the suitable values of cos hd and 
sin hd , where hd can always be taken as coterminous with 
some angle in the first quadrant. 

We shall indicate how this is done by an actual example. 

Example. To fit terms as far as a 4 cos 40, 6 4 sin 40 to the 
12 data (Whittaker and Robinson, Calculus of Observations , 
p. 272): 

Uq Ui f*j 1*3 1*4 1*5 1/fl 1*7 1*8 t*9 1*10 1*11 

2-71 3*04 2-13 1-27 0-79 0-50 0-37 0-54 0-19 -0*35 -0*44 0-77 


First write the data in a scheme U of columns, down, up, 
down, up, with blanks as indicated by the dots, as follows : 



~271 

37 

. 

— 



+ 

+ 

+ 

V = 

304 

50 

54 

77 

M = 

+ 


— 

+ 

213 

79 

19 

-44 

+ 

— 

+ 

— 


127 

• 

— 35 

• _ 

9 

+ 

4- 

— 

_ 


Next, add along the rows of the scheme U , after giving 
sign to the columns of U in four different ways, according 
to the rows in the sign-scheme M. We thus obtain four 
separate sets of totals, and these are combined with cosines 
and sines of 0°, 30°, 60°, 90° in four separate schemes, as 
below. (We have included the coefficients necessary for 
computing o 6 , b 5 and a 6 as well.) 



a 0 

a 2 

a \ 

"6 


<h 



308 

0*5 

1 

1 

0*5 

234 

1 

1 

1 

485 

0*5 

0*5 -0*5 

-0*5 

277 

0*866 

0 

-0*866 

267 

0*5 - 

-0*5 -0*5 

0*5 

71 

0*5 

-1 

0*5 

92 

0-5 - 

-1 

1 

-0*5 

162 

0 

0 

0 

6 

576 325 

24 

-1 

6~ 

509*4 

163 

26*6 

0*960 0*542 

0*040 

-0*002 


~0*849 

0*272 

0*044 


*2 


b 4 



bi 


b& 

234 

0 


0 


308 

0 

0 

0 

231 

0*866 


0*866 


223 

0*5 

1 

0*5 

197 

0*866 

— 

■0*866 


317 

0*866 

0 

-0*866 

92 

0 


0 


162 

1 

-1 

1 

6 

370*6 


29*4 


6 


548 

61 

-0*5 


0*618 


0*049 



0*913 

0*102 

-0*001 
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Explanation. 

a x = (2-34x14-2-77 X 0-8664-0-71 X 0-54-1-62 xO)/0 

= 0-849, etc. 

Hence, as far as terms in cos 40, sin 46, the regression is 
4-0-849 cos 04-0*542 cos 204-0-272 cos 30 
y = 0-960 4-0-040 cos 40 

4-0-913 sin 04-0-618 sin 204-0-102 sin 30 

4-0-049 sin 40, 

and the regressions to fewer or more terms involve the 
same coefficients a i9 as are given by the above scheme 
of solution. 

The sum of squared residuals may also be calculated 
beforehand from the regression coefficients in a scheme set 
out as follows : 


naj* and 


k 

n-2Jfc-l 

Jn(a|4-5|) 

S 2 

24-983 

4-(n— 2k — 1) 

0 

11 

11-059 

13-924 

1-266 

1 

9 

9-326 

4-598 

0-511 

2 

7 

4-054 

0-544 

0-078 

3 

5 

0-506 

0-038 

0-008 

4 

3 

0-024 

0-014 

0-005 


Just as in polynomial regression, the contributions to the 
sum of squared residuals produced by successive terms are 
subtracted in turn from u 2 , which here is 24-983. The estimate 
of variance of a single residual is then made by dividing the 
residual sum of squares by n — 2k — 1, the number of degrees 
of freedom. The results are shown in the last column. 

67. General Regressions. After what has preceded, 
the routine to be adopted in other regressions, such as 

y = a 0 -{-a 1 t&n6-\-a 2 tsin2d-\-...+a k tsin.kd . (1) 

will be readily understood. Such regressions are not 
common in statistical work, but they are not outside the 
bounds of possibility. The desirable thing in any problem 
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of regression will be to express y , if possible, in terms of 
functions which, like the Tchebyehef polynomials or the 
sine and cosine of multiples of 27r/n, have the orthogonal 
property that the product-sums of different functions of 
the set over the range vanish. The effective meaning of 
this is that the contributions of the successive terms to 
the regression are uncorrelated with each other. 

Harmonic Analysis. For fuller details concerning the 
practical routine of estimating periodic regressions, the 
reader may consult the chapters on harmonic analysis in 
Whittaker and Robinson’s Calculus of Observations , or 
Brunt’s Combination of Observations , 2nd edition, 1931. 



CHAPTER VII 


PROBABILITY DISTRIBUTIONS OF STATISTICAL 
COEFFICIENTS 

68. Sampling Distributions. A statistical coefficient 
computed from a sample of n values, univariate or multi¬ 
variate, is only an estimate of the corresponding parameter 
in the population or underlying probability function. It 
is therefore to be presumed erroneous, though the degree 
of error cannot be affirmed exactly, since the true value of 
the parameter is not known. The degree of error can be 
stated only in terms of probability ; and the probability 
distributions involved are (i) the hypothetical population, 
or distribution of the variate or variates, (ii) the derived 
distribution of the coefficient of estimate from sample. 
The second of these is called the sampling distribution of 
the coefficient. 

Let us consider a case in which the first of these 
distributions, the probability distribution of the variate, 
is not hypothetical but given. In Charlier’s experiment 
(22) of drawing 10 cards from a pack, with replacement 
of each card, and continuing this until a sample of 1000 
sets of 10 cards had been collected, the variate was the 
number x of black cards in a set of 10, and its probability 
distribution was the binomial distribution, with mean 5 
and variance 2*5 ; the corresponding values of mean and 
mean square deviation in Charlier’s sample were 4*933 
and 2*415. Are the respective deviations 4*933—5, or 
—0*067, and 2*415—2*5, or —0*085, reasonable or 
abnormal ? Such questions can be answered only when 
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the sampling distributions of the estimates of mean and 
variance are known. 

The nature and genesis of these sampling distributions 
can be illustrated from this same example. The sample 
group of 1000 sets of 10 card drawings was merely one 
out of an enormous number of equally possible groups. 
Prom the pack of 52 cards the 10 cards, drawn one at 
a time with replacement, could eventuate, if order of 
drawing were taken into account, in 52 10 ways. This is 
an unimaginably large number, but the number of groups 
of 1000 sets which may be chosen from these 52 10 sets 
is incomparably greater still. Each group may be 
supposed to have its mean m and mean square deviation 
s 2 , computable in the usual way. The aggregates of these 
values of m and s 2 constitute probability distributions, 
and these are the sampling distributions of m and s 2 for 
the kind of sample in question. 

Example. If the parent population is normal and the 
number in sample is n, the sampling variances of the 
estimates m 2 , m 3 , m 4 of the moments /x a , /ij, fi A are respectively 
2a 4 /n, 6ff # /n, 96cj*/n. For ... they increase rapidly. 

The functional form of a sampling distribution depends 
(i) on the population (probability function of the variate 
or variates sampled), (ii) on the function used for estimating 
the parameter, and (iii) on n , the number of observations 
in the sample. Since 1900, and especially since 1915, 
much research has been expended on the problem of 
deriving the probability distributions of the commoner 
coefficients. Most of this research has been devoted to 
samples of a normally distributed variate or variates, and 
the sampling distributions are now well known and already 
classic. It appears that as the number n in sample 
increases the sampling distributions of many coefficients, 
though by no means of all, tend themselves towards the 
normal type. In such cases it is customary to supply an 
estimate of the precision of a coefficient by appending to 
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its computed value its standard deviation of sampling, or 
standard error ; and this is sometimes said to imply a 
probability of 19/20 that the true value lies in the range 
delimited by twice (more precisely, 1-96 times) the standard 
error on either side of the computed value. 

The form of statement is not strictly accurate ; if, for 
example, a computed mean is ra, and the central 95 per cent, 
range of sampling probability area is the criterion of what is 
acceptable, then m may be anywhere from the extreme left 
of the 95 per cent, range of a sampling distribution centred 
on a hypothetical mean // to the extreme right of the 95 per 
cent, range of a second sampling distribution centred on a 
hypothetical mean fx" ; but these are different distributions, 
and it does not follow either that the left half-range p'-m 
of the first is equal to the right half-range m-p* of the 
second, or that we can add the probabilities, under the different 
hypotheses, that the true fi lies in these respective half-ranges. 
An illuminating discussion of the problem is given in a paper 
by Clopper and E. {$. Pearson, Bivmetrika 26 (1934), p. 404. 

When the number in sample n is small the sampling 
distribution of the coefficient is often of non-normal, skew 
or platykurtic type, and the standard error is an insufficient 
indication of the interval within which the true value of 
the parameter may lie. It is necessary in such a case to 
know the sampling distribution and probability integral 
of the special coefficient. 


69. The Sampling Distribution of Means. In a 

few cases the sampling probability function of the mean 
of n observations is of the same type as the probability 
function of the population. For example, the normal 
probability function with mean fi and variance ct 2 , 


<f>{x) = 


1 

a V2n 


e —*<*— 


• ( 1 ) 


has m.g.f. exp (fxa+|cr 2 a 2 ). Hence the m.g.f. of the sum 
of n sample values x i is exp (njia-\-\no 2 a 2 ). To change 
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from sum to mean is to write xjn for x , or ajn for a. Hence 
the m.g.f. of the mean of sample is exp (jua+|a 2 a 2 /tt). 
The mean of sample is thus distributed normally about 
the same mean as before, but with variance o- 2 /7i, or 
standard error a/Vn. 

Ex. 1. Prove that if k 19 k 2 , k 3 , ... are the cumulants 
of any population, the cumulants of the mean of a sample 
of n are k 19 K 2 /n, k z /u 2 , .... 

Ex. 2. The number x of black cards in a set of 10 in 
Charlier’s experiment is binomially distributed with mean 5 
and variance 2*5. The mean of a; in 1000 sets is distributed 
with approximate normality, about mean 5, and with variance 
2*5/1000, or 0*0025. The standard error is thus 0*05. The 
deviation of the mean 4*933 of Charlier’s sample from 5 is 
—0*067, about 4/3 of the standard error. 

The deviation is not excessive. From the table of the 
normal probability integral on p. 144 it is seen that the 
probability of a deviation exceeding l*34a is about 0*18. 

Again, if the probability function of x is of Gamma or 
so-called x 2 type, namely 

<l>(x) = (r(k))~ 1 x k ~ 1 e- m , . . . (2) 

the m.g.f. is 

/•CO 

x^e-^dx = (1—a)~*. . (3) 

The m.g.f. of the sum of n sample values x f is (1— a) _w *, 
and so the m.g.f. of m, the sample mean, is (1— a/n)~ n *. 
Reverting to the probability function, which by a theorem 
of Lerch is unique, we obtain the probability function 
of m as 

= n(r’(nfc))~ 1 (7irn) n *“ 1 e _nm . . (4) 

This is again of Pearson’s Type III. 

Ex. 3. Prove that the distribution of the sum (not the 
mean) of n values x i each obeying the Poissonian law ip(x) of 
33 is Poissonian. (Use the f.m.g.f. of x.) 
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70. Distribution of Mean Square in Normal 
Sample. If the probability differential of x is 


dp = 


1 

V 2rr 


e~* x% dx. 


that of z — \x 2 , as in 37 (2), is 


* (1) 


dp = z~ie~ z dZy . . . ( 2 ) 

V 7T 

and so the m.g.f. of z is (1— a)~*, by 69 (3). Hence the 
m.g.f. of half the sum of the squares of n sample values x f 
is (1— a)~* n , and so if s 2 is the mean of the squares the 
m.g.f. of JtS 2 is (1 —a/n)~* n . It follows that the probability 
differential of u , where u — £s 2 , is 

dp = [r(in)]- 1 n* n (u)i n - 1 e- nu du, . . ( 3 ) 

again of x 2 type. By changing from J-s 2 to s 2 we have 
the probability function of s 2 , namely 

<f)(s 2 ) = 2-^[r(Jn)]- 1 n^(5 2 )*< n - 2 >6-* w ^. . (4) 

In unstandardized units we must write s 2 /c t 2 for s 2 on 
the right of (4), and insert the factor 1/or 2 . 

The seminvariant g.f. of s 2 is 

—\n log (1—2 a/n) = Jn(2a/n+4a 2 /2n 2 -f...) 

= a+2a 2 /2 !n+.(5) 

Thus the mean of s 2 is 1 and the variance is 2jn ; in 
unstandardized units these are a 2 and 2 o l jn, where a 2 
is the variance of x. The s.g.f. also shows that as n 
increases the m.g.f. of s 2 tends to asymptotic equivalence 
with exp (a+a 2 /n) ; and so the distribution of s 2 tends 
to normality. 


Example. The distribution of s z in Charlier’s 1000 sets is 
almost normal; a 2 — 2*5, and s 2 computed from the sample 
(using deviations not from Charlier’s mean m — 4*933, but 
from fi = 5) is 2*419. The standard error of s* is o 2 y/(2/n) 
= 2*5/V500 = 0*112. The actual deviation, —0*081, is 
numerically about three quarters of this. 


I 
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71. Distribution of Estimate of Variance. The 
variance or second moment of the population is commonly 
estimated from the sample by taking the n th part of the 
sum of squared deviations of the sample values x i from 
the sample mean m. 

Of the n deviations from m only n—\ are independent, 
and this estimate of a 2 , which we shall call s 2 though 
pointing out that it is not the same 8 2 as in 70, can be 
expressed as a quadratic expression in n—1 independent 
values. Thus we have, by 14 (5), 

* 2 = +...+xl)/n-(z 1 +x 2 +...+x n )*/n* 

= (»—l)(z?+z|+...+ 2 j[_ 1 )/n 2 

-2(z 1 Z 2 +2 1 «3 + — + Z »-2 2 »-1 )/«*• ■ • (1) 

where ^ = x x -x n , z 2 = x 2 -a: B> .... z ^ = x n . x -x n ; and 
this is but one of many ways in which s l may be expressed 
in terms of only n —1 variables. The z i here are linearly 
independent, though correlated by possessing the term 
— x n in common. (See Appendix, 5.) 

This loss of a degree of freedom, for that is what it is, 
complicates the problem of finding the distribution of s 2 , 
but its m.g.f. can be evaluated as a multiple integral over 
the n sample values, and proves to be (1— 2 a/n)~~^ n ~ 1 \ 
which differs from that of the s 2 in 70 only in the exponent, 
n—l replacing n. It follows that the distribution of s 2 is 
again of x 2 type, its probability function being in fact 

<(,(**) = ” 1 W K«-l)( S 2)Kn-3) e -lnH > (2) 

which should be compared with that of 70 (4). 

This distribution is called Helmert’s distribution, after 
the German astronomer and geodetist F. R. Helmert, who 
published it in 1876. 

By expanding the m.g.f. and noting the coefficient of 
a we find that the mean value of s 2 over all samples of 
n is (n—l)a 2 jn f where a 2 is the variance of x . This 
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is really the theorem of mean square residual of 59 (5), 
and it is true not merely for normal but for general popula¬ 
tions. Because of the factor (n—\)/n the precept is often 
given to estimate variance by dividing the sum of squared 
deviations from m not by n but by n—l. On the other 
hand, the discrepancy in 8 2 caused by not doing this is 
of order 1 /n, whereas the standard error of sampling of s 2 
is of order \Z(2/n). Thus the error of method is to the 
error of sampling in approximate ratio 1 : \/(2n), which 
even for n as small as 25 is less than 1/7. To insist on 
the divisor n— 1 rather than n in large samples may 
therefore seem a little pedantic ; but in small samples 
an appreciable difference is made. One advantage of the 
division by n—l is this, that with the modified s 2 the 
probability function (2) assumes the form 

\n—l)*<”- 1 V)*< n - 3 >e“‘ (n - 1 > , \ (3) 

which is now of exactly the same form as in 70 (4), with 
n—l for n throughout. Thus the loss of a degree of 
freedom is made apparent. In unstandardized units we 
must write s 2 /a 2 for s 2 and insert on the right of (3) the 
factor 1/a 2 . 

The m.g.f. of s 2 in (3) is [1 — 2a/(n— l)]-*^- 1 ), from 
which it follows, as in 70 (5), that the sampling variance 
of this modified «s 2 is 2cr 4 /(n—1), the standard error thus 
being a 2 V2/^(n—l). 

Example. By considering the coefficients of a 3 /3! and 
a 4 /4! in the s.g.f. investigate the skewness and excess of the 
distribution of s 2 . 

72. “ Student’s Ratio ” t and its Distribution. We 
have seen in 69 that the mean m of a sample of n values 
x i drawn from a normal population of mean p and variance 
a 2 is distributed normally with mean fi and variance a 2 In. 
It follows that the standardized deviation (m—^^n/cr is 
distributed normally with mean zero and variance 1. 
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Now in practice we do not know a 2 and so we cannot 
standardize the scale. All that we know is the estimate 
(taking n—l for divisor) s 2 = Z(x j —m) 2 l(n—l). The 
deviation of the mean of sample from true mean, standard¬ 
ized by this estimate s 2 , is thus ( m—fji)\/n/s = t. This 
is “ Student’s Ratio,” and it is not normally distributed. 

“ Student ** was the pen-name under which W. S. Gosset 
(1876-1937) wrote his statistical papers. He discovered the 
distribution in 1908. 

To simplify the distribution we may place the origin 
of x at x = /x, thus putting fi = 0. Then mV njs = t. 
Since mVnjs = (mVnlo)l(s/(T) i and since the distributions 
of mVn/a and s 2 /o 2 are independent of <y, we may use 
standard scale with a = 1. 

For constant s 2 we have dm = sdt/y/n ; also the 
probability of obtaining the value t is the probability that 
m takes the value stjy/n , and the probability differential 
for this is 

ce -\ n > n *dm = cn~tse~ is,{i dt. . . (1) 

This is for constant s 2 ; and so the probability differential 
of t is the integral of (1) over all values of s 2 . Hence, 
multiplying (1) by the probability differential of s 2 , which 
we already know from 71 (3), and integrating from 0 to oo, 
we have 

dp(t) = Cjdf I 8 e~i , ' t 's n ~ 3 e ~* (n ~ 1), ’ds 2 

= c i [l+t i l(n-l)-]-*dt . . . (2) 

where Ca = . (3) 

the constant c 2 being fixed, as always, by the condition 
that the total probability is 1. 

Note . The above derivation is the one usually given, 
but an important remark must be made. The essential step, 
the compounding of the probability differentials of m and s*, 
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presumes the statistical independence of m and «*. This 
independence (Appendix, 6 ) is not evident, nor is it 
capable of quite elementary proof. The reader may assume, 
however, both here and in the case of the difference of means 
of two samples, that the numerator and denominator of t 
are independent. 

The remarkable and important fact about the 
^-distribution is that it does not involve the unknown a 2 , 
a partial reason being that t is a ratio , of zero dimension 
in or 2 . The discovery of the distribution in 1908 had a 
profound influence on “ small sample *’ theory; for 
whereas it had long been conventional to take s as the 
presumptive o and to estimate the probable region of the 
unknown (l by regarding (m-/x)Vn/5 as a standardized 
normal variate, this was now seen to be an inexact 
procedure, and the ^-distribution was used instead. 

Since [l+t 2 l(n—l)]~ in tends with increasing n to 
exp (— it 2 ), it is apparent that for large samples the 
^-distribution tends to the standard normal one; but the 
tendency is not rapid, and for small values of w, as one 
might suspect from noting that n — 2 gives the Cauchy 
distribution, the departure from normality is marked, the 
curves being platykurtic. For example, whereas in the 
normal curve 0*95 of the area is contained in the range 
x = —1*96 to x = 1*96, in the £-curve for n = 10 the same 
area lies between t = —2*26 and t = 2*26; and for area 
0*99 the ranges are given by x = ± 2*58 and $ _ -j-3-25. 

A table of the probability integral of the ^-distribution, 
in a form useful for practical application, is given in R. A. 
Fisher’s Statistical Methods for Research Workers , 8th edition, 
p. 167. His n is our n— 1, the number of degrees of freedom. 

Example. A coin, thrown 20 times on each of 10 occasions, 
shows 7, 9, 6, 10, 13, 6, 9, 7, 10, 7 heads respectively. 
Assuming the binomial distribution of 20 throws to be 
approximately normal, consider \\ hether the coin is biassed. 

The mean of the heads thrown is ni — 8*4 and s 2 = 4* 93. 
« = 2 - 22. Thus, presuming an unbiassed /u = 10, we have 
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t = (10—8*4)\/l0/2*22 = 2* 28. From Fisher’s tables, in the 
row n — 9 (our n— 1) we find t = 2*262 at P = 0*06. (P is 
the probability of a t numerically greater than 2*262.) Thus 
the coin-throws leave it rather doubtful whether the coin is 
biassed or not. 

A reading of tables of the normal probability integral for 
x = 2*28 would have given P — 0*023, with an unjustifiably 
stronger suggestion of bias in the coin. 

73. Difference of Means of Two Normal Samples. 

A valuable use of the ^-distribution is in testing the 
hypothesis that two samples, with different numbers n 
and N in sample, are from the same normal population, 
of mean p, = 0 and variance a 2 . 

Let x v x 2 , ..., x n be the first sample, X v X 2 , ..., X N 
be the second, with respective means m and M , and 
estimates of variance 


s 2 = 27(s,--m) 2 /(n-l), 8 2 = £(X,-lf) 2 /(tf-l). (1) 

The basis of the test is the difference m—M , the 
variance of which (15) is 

a 2 /n+G*IN = (n+N)(j 2 /nN. . . (2) 

The estimates s 2 and S 2 of a 2 are (71) of weights n—1 
and ^—1, and so yield a combined estimate of a 2 , namely 


8 2 = [(n-l)s 2 +(N-l)S 2 ]/(n+N-2) 

= [£(Xj-m) 2 -\-Z!(Xj—M) 2 ]/(n-\ r N—2). . (3) 


It can be proved that m—M and s 2 are statistically 
independent. We therefore define, from (2) and (3), 


m—M ( nN \ 


• (4) 


and it now follows, by the argument of 72, that this t 
has the ^-distribution, but with n+N— 2, the number of 
degrees of freedom used in estimating ex 2 , in place of the 
former n—1 . Thus the ^-tables may be consulted for the 
probability P(t) that t numerically exceeds any assigned 
value. (The examples of p. 109 are amenable to J-tests.) 
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The important point is the way in which a 2 is estimated. 
One might have pooled both samples and estimated a 2 from 
the squared n deviations about the pooled mean 
(nm+NM)l(n-\-N) y summed and divided by n+JV —1. This 
slightly more accurate estimate of a 2 is, however, not 
independent of m—M. 

74. The Ratio of Two Variates of the Same 
Type. The two samples in 73 give in general different 
estimates s 2 and S 2 of the variance or 2 . If the question 
is whether both samples are from the same normal popula¬ 
tion, we shall wish to test this by means of s 2 and S 2 f 
without reference to the unknown cr 2 . The analogy of 
t suggests the ratio u = s 2 /S 2 . Since u is unaltered when 
we write s 2 /a 2 for s 2 y and S 2 jo 2 for S 2 , we may work in 
standard scale, using the ^-distribution of 71. Let us 
write v = s 2 /cr 2 , V = $ 2 /or 2 . Then u — vjV, or v = uV. 

By 71 the probability differentials of v and V are 
c lV U n -Ve-Un-i)v d v and C^M-Ve-W-WdV. . (1) 

For fixed V we have dv = Vdu ; so, integrating for all V, 
we have the probability differential of u , 

CjCydu f V(uV)^ {n ~ 3 ' ) e~ i{n ~ 1 )uV V^ N ~ l)r dV 

JO x 

= c 1 C 1 uU n ~Vduj v^+n-Ve-W-'+^W’dV 

= cu^ n ~ 3) du/{N-l+n~[ u)^ n+N ~ 2 \ . . . (2) 

where c is fixed by making the integral of u unity. 

The distribution of u is thus given by (2). It is 
interesting to verify that as N—>oo the distribution tends 
to the x 2 tyP°> while if n = 2 we have a t 2 distribution. 

The z-Distribution of Fisher. R. A. Fisher, in 
testing the difference of two estimates s 2 and S 2 , uses 
not this ratio u but half its natural logarithm. If we put 

z — \ log e u, u = e 2z , da — 2e 2s dz, 
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the probability differential (2) becomes 

dp = c 2 e( n ~ 1)z dzl[N—l+(n—l)e 2z ]M n+N ~~ z \ . (3) 

where c 2 is such that the total integral of z is unity. The 
distribution thus obtained is Fisher’s ^-distribution. 

Tables of P(z) 9 the probability of a z greater than an 
assigned value, are given in Fisher’s Statistical Methods for 
Research Workers . In these tables the numerator of u is 
the greater of s 2 and S 2 f so that z is positive ; and the 
functions tabled are the values of z for assigned n and N, 
such that P = 0*05, 0*01 and 0*001 respectively. The table 
for P = 0*001 is due to C. G. Colcord and L. S. Doming. 

75. Analysis of Variance and of Sum of Squares. 

The basic idea of the experimental designs introduced by 
R. A. Fisher, and of the accompanying technique called 
analysis of variance , is that of dividing up a total sum of 
squared deviations of a variate from its sample mean into 
several distinct sums of squares, each corresponding to a 
source, real or suspected, of variation. These partial sums 
yield estimates of the variance from each source, and the 
z-test is applied to ascertain whether these estimates are 
compatible with each other and with the estimate of residual 
variance. If they are not so compatible, it is presumed 
that the sources have distinct effects, which are further 
analysed, for example by difference (73) of means. 

The resolution into sums of squares is founded on the 
Lemma, noted in 52 in connexion with the correlation 
ratio, that if k sets of n v n 2i n k observations, with 
respective means Mj and mean square deviations $?, are 
pooled in an aggregate of n = n l +n 2 +...+n 1c observations, 
with mean M and mean square deviations S 2 9 then 

nS 2 = £n,(jS?+<£)> . . . (1) 

i 

where c, = M —M 

For illustration we shall consider an experiment based 
on repeated trials and designed to ascertain (i) whether h 
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varieties of a cereal are different in crop yield, (ii) whether 
k kinds of fertilizing treatment are different in their effect 
on the crop yield of the h varieties. 

Consider first the case (i), the experiment on varieties 
alone. Suppose each of them planted in k similar plots, 
assigned in random positions in a field, and subjected to 
uniform cultivation. The hk yields y ijt where i refers to 
variety, j to plot-number, may be arranged for analysis 
in a rectangular scheme of h rows and k columns, a row 
to each variety. For convenience in the algebra let us 
choose the origin of y ti so that the sum or mean of all y ij 
is zero. 

Now consider the sum EE y\. over all hk deviations. 
i 5 

Let the means of rows (varieties) be y 1Qi y 20i y A0 . 

Then by (1), remembering that the general mean is zero, 
we have 

ZEy% = SZ(y ij -y i0 )*+k£yl . . (2) 

i i H i 

The sums here are sums of squared residuals, and under 
the assumption that all plot-yields have zero mean and 
variance a 2 , the mean values or expectations of the terms 
give, by 59 (5), the relation 

(hk-l)a 2 = h(k-l)<j 2 +(h-l)a 2 ,. . (3) 

where the terms correspond to those in (2). The first 
term on the right follows from the fact that the mean 
value or expectation of sum of the k squared deviations 
for any row is (k—1 )ct 2 ; and the second term then follows 
by subtraction. 

The coefficients in (3) are really degrees of freedom ; 
and we thus distinguish hk—1 degrees of freedom for all 
hk plots, of which ^—1 are for variation between means 
of rows, that is, between varieties, and h(k—1 ) are for 
variation about the particular variety means y i0 , that is, 
within varieties. 

If the hypothesis to be tested is that varieties are not 
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essentially different in yield, this is the same as to suppose 
that variation between varieties is subject to the same 
cause as variation within varieties, that is, to ordinary 
randomness arising from soil heterogeneity and other 
causes common to all plots. The test is therefore to com¬ 
pute an s 2 from the sum of squares between varieties 
and an S 2 from the sum of squares within varieties, these 
being independent estimates of a 2 , and to see from the 
21 -table whether they are compatible. In the calculation 
of s 2 and S 2 the respective degrees of freedom should be 
used as divisors ; and S 2 is most easily calculated by 
means of 

h(k—l)S 2 = ZZy%-kZy> 0 . . . . (4) 

i j i 

76. Analysis into Two Sources of Variation and 
Residual. Next, still with the same h-by-k arrangement 
(which in the random placing of plots in the field is 
called the “randomized block” arrangement), let the 
rectangle of h rows and k columns of yields be set out 
for analysis in the case when there are not only h different 
varieties, but each is subjected to k different treatments, 
so that y {j is the yield of the i th variety under the j th 
treatment. Let the means of columns (treatments) be 
Voi> yo 2 > •••» Vo Jr 

Consider the term SE(y ii ~y i0 ) 2 in 75 (2), and imagine 

ij 

all the deviations from mean of variety, to 

set out in a rectangle just as the y tj were. Since 
Zy iQ = ZZy^lk = 0, the means of the y^—y t0 in columns 

i i 3 

are merely those of the y ti themselves, namely y ov y 02 , 

**•» Vo Jr 

Hence, by analysing this term exactly as EEy\ was, 

< j 

but with respect to column means instead of row means, 
we have 

ZZiVij-Viof, = -Vxo-Vos?+ hE yly ■ ( 1 ) 

a a i 
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Hence again, from 75 (2), 

ZEy}. = EZ(y {j -y i0 -y 0j ) 2 +kZy%+hZyl jt . (2) 

i j i j i j 

which exhibits a threefold dissection, the last two terms 
on the right corresponding to variation between varieties 
and between treatments respectively, and the first term 
to residual variation. As for degrees of freedom, by taking 
expectations as before of these sums of squared residuals, 
we have 

(hh-l)o* = (/^—l)(fc—l)cr a +(^—l)cr 2 -h(A;—l)cr 2 , (3) 

the coefficients giving the desired divisors of corresponding 
terms on the right of (2), for estimates of variance. The 
comparison of estimates by the z-table is then available. 


77. The Latin Square. In the arrangement shown 


A 

B 

G 

D 

E 

on the left, each of A, B> C, D , E 
appears exactly five times in rows 

E 

C 

A 

B 

D 

and columns of a square, but 

B 

D 

E 

C 

A 

no letter occurs twice in the same 

D 

E 

B 

A 

C 

row or same column. Such an 

C 

A 

D 

E 

B 

arrangement of h letters each 

square 

of order h. 

repeated h times is called a Latin 


Imagine the Latin square to be a scheme of plot-yields 
set up for analysis, the letters representing yields of 
different varieties, the rows corresponding to varied 
treatments of one kind, the columns to varied treatments 
of another kind ; for example, two kinds of fertilizer 
applied at once, at h different levels of strength in each. 
There are thus three dimensions of variation, two for 
treatments and one for variety ; and so the yields may 
be written y iib where i refers to row, j to column, l to 
variety. Let the respective means for rows, columns and 
varieties be y i0Qi y oiQ and y 00i . Each suffix runs from 1 to A. 
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A first analysis as in 76 (2) gives us 

“ EE(y {jl -y m ~2/ojo) 2 +^2/ioo(^) 

ij ij i 5 

But now arrange the ^ 7—^00 r0W8 > 8a y> according 

to l = 1, 2, A, and analyse once again. Since the 
means of y m and y ojo are zero, the means of 0 o 

— y m are simply y 00 j, where Z = 1,2, ..., A. We therefore 
have 

ZZyh = ^(%-2/ioo ~%jo-Vrn ) 2 +*%+*%+*% (2) 

t i < i i j i 

where the three last terms on the right are sums of squares 
for variation between rows, columns and varieties respec¬ 
tively, and the first term is for residual variation. By 
taking the expectations of these sums of squared residuals 
we have 

(**—ljor* » (A- 1 )(A-2)a 2 + ( A~ 1 )(j2+(A-l)a2+(A- 1 )a2, (3) 

which shows the respective degrees of freedom to be used 
as divisors in the estimates of variance. 

Example. The entries in the square below are the numbers 
of successes in 25 sets of 10 drawings with probability p — 0-52, 
written down consecutively in 6 rows. The mean squares of 
the analyses may be compared with the theoretical & 2 , which 
is 10x0-52x0*48 = 2*50. 

The working details, based on the formulae of 75, 76, 77, 
are shown (i) in ordinary row and column analysis, applicable 
equally to A rows and k columns, (ii) in Latin square analysis, 
using the particular Latin square ( q.v .) given above. 


(i) 5 

3 

2 

4 

6 

Sams. 

20 

Means. 

4*0 

6 

7 

5 

4 

5 

27 

5*4 

3 

6 

3 

6 

5 

23 

4*6 

8 

3 

7 

6 

4 

28 

5*6 

6 

2 

6 

4 

5 

23 

4*6 

Sams 28 

21 

23 

24 

25 

121 


Means 5*6 

4*2 

4*6 

4*8 

6*0 


4*84 
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Sums. Means. 

23 4-6 

22 4*4 

25 5 0 

29 5-8 

22 4-4 

Total . . 121 4-84 

The various sums of squares used are : (1) the sum of 
squares of all 25 entries, namely 647 ; (2) the sum of the five 
products, row-sum by row-mean, 594*2 ; (3) the same for 
columns, 591*0 ; (4) the same for letters in Latin square, 
592*6. Each one of these must be corrected for transference 
to the general mean 4*84, and the correction in every case is 
to subtract the product of total sum by total mean, 121 by 
4*84, or 685*64. 

Thus the corrected sums of squares are (1) 61*36, (2) 8*56, 
(3) 5*36, (4) 6*96. The residual sum of squares is found by 
subtraction from the total sum, and the details of estimate of 


mean square are set out in tabular form thus 

: 

(i) Row and column analysis. 




Sum sq. 

Degr. 

Mean sq. 

Rows 

8*56 

4 

2*14 

Cols. . 

5*36 

4 

1*34 

Res. . 

47*44 

16 

2*97 

Total . 

61*36 

24 

2*56 

(ii) Latin square analysis. 

Sum sq. 

Degr. 

Mean sq. 

Rows 

8*56 

4 

2*14 

Cols. . 

5*36 

4 

1*34 

Letters 

6*96 

4 

1*74 

Res. . 

40*48 

12 

3*37 

Total . 

61*36 

24 

2*56 


We need not continue, but in practice the mean squares 
for rows, columns and letters would be compared with oach 
other and with the residual mean square by taking half the 


Latin sq. 

(ii) A 

B 
C 
D 
E 
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difference of logarithms and applying the z-test. (The 
logarithms are Napierian, and we may note the relation 
i lo g 0 u = 1-151 log 10 «.) 

The principle of isolation, by appropriate experimental 
design, of the separate variations due to several simul¬ 
taneous causes, has been developed and widely applied 
in recent years. Complex patterns, such as randomized 
blocks in which each element is itself a block, or Latin 
squares in which each “ letter ” is a Latin square, have 
been designed and used. The idea is to save time, space 
and expense by being able to conduct several kinds of 
experiment at the same time and within the one frame. 
For further details the reader may consult Fisher’s The 
Design of Experiments , 2nd edition, or Yates’s The Design 
and Analysis of Factorial Experiments (Harpenden, 1937). 


78. Conclusion. The consideration of other sampling 
distributions would exceed our space and scope, but one 
of special interest may be noted. The distribution of r, 
the standardized product - moment estimate (without 
Sheppard’s correction) of p in normal correlation, was 
found by R. A. Fisher in 1915. The probability function 
has the rather complicated form 


<j>(r) = c(l-p 2 )h w - 1) (l~r 2 )* (n - 4) 


d n ~ z 

d(rp) n -* 


arc cos ( — rp)\ 

, /(I) 


and the curve, if p is at all large and the sample small, 
is skew and in cases even U-shaped. (The function and 
its integral have been computed, for n = 3, 4, ..., 25, 50, 
100, 200, 400 and p = 0-1, 0*2, 0-3, ..., 0*9, by F. N. 
David in Tables of the Correlation Coefficient, London, 
1938.) 

It was proved by Fisher ( Metron , 1921) that the 
hyperbolic tangent transformation 


z' = i log 





• ( 2 ) 
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produces a distribution which even for n as small as 20 
is nearly normal, with mean £ and variance l/(n—3). 

A second transformation of r, namely 


V(1 -r 2 ) 


Vn— 2 , • 


. (3) 


leads to a ^-distribution with n—2 degrees of freedom. 
These transformations are necessary, because of the 
extreme non-normality of the sampling distribution of r, 
which makes the crude use of the standard error of r a 
fallacious procedure. 


79. Estimation of Parameters from Sample. In 

40 and 42 we have estimated the mean p, of a normal and 
a Poisson distribution by the mean m of the sample, in 
43 we have pointed out the demerits of the mean of sample 
in estimating the true mean of a Cauchy distribution. 
The general problem of estimation is this: given n sample 

values x v x 2 .r n of a variate x with probability function 

<f>(x ; 6) involving a parameter 0, what function T(x v x 2 , 

..., x n ) of the sample values shall be used to estimate 6 ? 
The problem must be posed in mathematical terms, and 
must, in order to become intelligible, assume a certain 
degree of arbitrariness. One fruitful principle, well 
justified by its results, consists in choosing T by making 
the compound probability density of x v x 2 , ..., x n a maximum 
with respect to 6. This is R. A. Fisher’s principle of 
maximum likelihood. Another principle postulates (i) that 
T shall be unbiassed , in the sense that the mean value of 
T over all samples of n data shall be equal to 6 , (ii) that of 
all such functions T shall be the one with minimum sampling 
variance. In many cases these two different approaches 
(the second of which has not yet been deeply explored) 
lead to the same function T of estimate. The situation is 
parallel to that which occurs in the theory of Least Squares, 
where, as mentioned at the end of 57, different sets of 
postulates lead to the same normal equations. 
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80. Four-Place Table 


X 

4> 

X 

$ 

0-00 

0000 

0*50 

1915 

002 

0080 

0*52 

1985 

0 04 

0160 

0*54 

2054 

0 06 

0239 

0*56 

2123 

0*08 

0319 

0*58 

2190 

0*10 

0398 

0*60 

2257 

0*12 

0478 

0*62 

2324 

0*14 

0557 

0*64 

2389 

0*16 

0636 

0*66 

2454 

0*18 

0714 

0*68 

2517 

0*20 

0793 

0*70 

2580 

0*22 

0871 

0*72 

2642 

0*24 

0948 

0*74 

2703 

0*26 

1026 

0*76 

2764 

0*28 

1103 

0*78 

2823 

0*30 

1179 

0*80 

2881 

0*32 

1255 

0*82 

2939 

0*34 

1331 

0*84 

2995 

0*36 

1406 

0*86 

3051 

0*38 

1480 

0*88 

3106 

0*40 

1554 

0*90 

3159 

0*42 

1628 

0*92 

3212 

0*44 

1700 

0*94 

3264 

0*46 

1772 

0*96 

3315 

0*48 

1844 

0*98 

3365 

0*60 

1915 

1*00 

3413 


of $(*) = (2ir )-» fV iz ’dx. 

Jo 


X 


X 

$ 

1*00 

3413 

1*50 

4332 

102 

3461 

1-55 

4394 

104 

3508 

1*60 

4452 

1*06 

3554 

1*65 

4505 

1*08 

3599 

1-70 

4554 

M0 

3643 

1*75 

4599 

M2 

3686 

1*80 

4641 

1*14 

3729 

1*85 

4678 

1*16 

3770 

1*90 

4713 

1-18 

3810 

1*95 

4744 

1*20 

3849 

2*00 

4772 

1*22 

3888 

2*10 

4821 

1*24 

3925 

2*20 

4861 

1*26 

3962 

2*30 

4893 

1*28 

3997 

2*40 

4918 

1*30 

4032 

2*50 

4938 

1*32 

4066 

2*60 

4953 

1*34 

4099 

2*70 

4965 

1*36 

4131 

2*80 

4974 

1*38 

4162 

2*90 

4981 

1*40 

4192 

3 00 

49865 

1*42 

4222 

3*20 

49931 

1*44 

4251 

3*40 

49966 

1-46 

4279 

3*60 

49984 

1*48 

4306 

3*80 

49993 

1*50 

4332 

4-00 

49997 


A decimal point is understood before each entry ^(x) ; 
and second difference interpolation is advisable in the last 
column. 

A useful inverse table of the normal probability integral 
is the table of Probits in Fisher and Yates’s Statistical Tables 
for Biological , Agricultural and Medical Research (Oliver and 
Boyd, 1938), pp. 38-40. The “ probit ” is the value of a; which 
cuts off at its ordinate a given percentage of area measured 
from the left of the normal curve. 
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1. Finite Differences and Factorial Polynomials. Most 

tables of functions provide us with sequences of values which 
by a suitable choice of origin and scale may be denoted by 
u 0 , tq, u„ .... To these we may apply differencing and repeated 
differencing, analogous in the Calculus of Finite Differences 
to differentiation in the Infinitesimal Calculus (see Whittaker 
and Robinson, Calculus of Observations (Blackie), Chapters I 
to IV). The operations most commonly used are : 

the advancing difference, Au x — u x + x — 
the receding difference, V w cc “ v * - 

the central difference, Su a = . (1) 

the averaging operation, g u v =■ 

the mean central difference fiSi< x — i(u*+i“ u«-i), 

operations which may all be repeated. The classical formula 
of interpolation, which uses advancing differences derived 
from u 0 , tq, u t ..., is the Gregory-Newton formula 

u. = u 0 +xAu 0 +x™A 3 uJ2l+xi*>A'u 0 IZ\ + ..., . (2) 

a formula which terminates at n-f 1 terms if u m is a polynomial 
of degree n, and which in practical cases converges well, 
with negligible remainder, after a few terms. The formula 
is the analogue of the Taylor series in the Infinitesimal Calculus. 

Tho polynomials 1, x , x {3) , ./<*>, ... which appear in (2) 
are (29) ordinary factorial polynomials 1, x, z(x — 1), 
z(x — l)(x — 2), .... If they are divided respectively by 01, 
1!, 2!, 3!, .. we obtain the reduced factorials or binomial 
coefficients 1, x, sc (8) , x it) , ... . 

Central factorials may be defined by 

1, jcB) = ccW «= (x-f J)(ar— J), aK 8 > = (x+ l)(a*(x — l), ..., (3) 

the factors being in arithmetical progression of common 
difference unity and centred at x. The render may verify that 
fix*- 1 ) = X, (jUtf 3 ) — X 3 , fixi*) X(X 3 — i), flZ'W *= x 3 (x 3 — 1), (4) 
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Given an odd number of values of tt M with central value 
u 0 , the Newton-Stirling formula of interpolation is useful, 

tig =■ u 0 +x. fi8u 0 + fiXt a ).8*u 0 /2l+x( 9 K /i8 3 u 0 /3l + .... (6) 

Given an even number of values with two central values 
I and U|, the Newton-Bessel formula is the appropriate one, 

tig = /iU 0 '4-/ia5.8w 0 +^ , ^/i£ 2 Mo/2l •+■ . 8 8 ti 0 /»ll • • • • (6) 

These formulas use oentral and mean central factorials, and 
mean central and central differences, alternately. 

The origin of interpolation x = 0 can almost always be 
chosen so that x f in the interpoland u., need not exceed J. 

The following relations are fundamental: 

Ax {r * = rr< f -D, equivalent to Ax (r) — 8;c{ f l = rxb 

(Cf. Dx r ss rx 9 - 1 in the Differential Calculus.) 

2. Finite Sums. The following table of repeated 
summation upon w 0 , t^, ..., w w _ lf exemplified for n = 5, 
follows the scheme proposed in 19 for computing factorial 
moments. 

u E 2* & Z* 

ttj ii 1 +2fi 1 + 3tt t +4f<4 

u t u % +u t +u A u % +2u 9 +3u A u 3 f 3w, f flw 4 

i/ s u a+ w 4 u 8 -f2u 4 Uj -f 3ii 4 w B 4 *t*4 

u A w 4 u 4 ti 4 

Scrutiny will show that the entries at the tops of the 
successive columns of summation are the reduced factorial 
moments: 

Ztigt Zxu x , ZX( 2 )U Wf Zx {9) v„ Ex^Ug,, ... • 

This may be proved by an induction based on Ax {r) = 

With a little more difficulty, using central factorials, it 
may be proved that the scheme of repeated summation 
toward the centre with alternate averaging, used in Ex. 3 
of 19, produces reduced oentral and mean central factorial 
moments ExMu r /r\ and Znxfrhi* frU 
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3. Relations between Powers and Factorials. We have 

(i) x — x, (ii) x — x 

x 2 — x i2) +#, x 2 = 

x 3 = rc <3> -h-3a? <2) +5C, x 3 = x&}-\-x, 

x 1 — x {4) -|-6x (3) -f-7x (2) -j-x, x 4 -= /jlx^ -f- fi.x&). 

as may be verified by actual expansion. Multiplying any of 
these relations by u* and summing over equally spaced values 
of x, (i) with x = 0 as least value, (ii) with x = 0 as middle 
value, we derive the relations quoted and used in 19, Exs. 1 
and 3, for converting factorial moments, or central and mean 
central factorial moments, into ordinary moments. 

4. Tables of Normal Probability Integral and Poisson 
Function. A very convenient table of the normal probability 
integral in standard scale, to four places of decimals, is given 
in Bowley’s Elements of Statistics , p. 271. The table is 
accurate enough for most practical purposes, and may be 
interpolated by proportional parts, that is, only using first 
differences. We give a compact table in 80, p. 144. 

In the Poisson function the chief requirement is the value 
of e~ m . If a machine is available, the following short table 
enables e~ m to be computed with sufficient accuracy for 
m = 0 to 10. 

m e~ m m e~ m m e~ m m e~ m 

1 0-36788 0-1 0-90484 0-01 0-99005 0-001 0-99900 

2 0-13534 0-2 0-81873 0-02 0-98020 0-002 0-99800 

3 0-049787 0-3 0-74082 0-03 0-97045 0-003 0-99700 

4 0-018316 0*4 0-67032 0-04 0*96079 0-004 0-99601 

5 0-0067379 0*5 0-60653 0-05 0-95123 0-005 0-99501 

6 0-0024788 0-6 0-54881 0-06 0-94176 0-006 0-99402 

7 0-0009119 0-7 0-49659 0-07 0-93239 0-007 0-99302 

8 0-0003355 0-8 0-44933 0-08 0-92312 0-008 0-99203 

9 0-0001234 0-9 0-40657 0-09 0-91393 0-009 0-99104 

10 0-0000454 1-0 0-36788 0-10 0-90484 0-010 0-99005 

For smaller values of m than those given above the approxi¬ 
mation 1— m for is correct to at least five decimals. 

Ex. In the example of 42 we have m — 3*870. Entering 
the above table at m — 3, 0-8, 0-07, we form the product, 
thus : 0-049787 X 0-44933 X 0-93239 = 0-20858. 

K2 
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6. Linear Dependence, Functional Dependence, Cor¬ 
relation, Statistical Dependence. These are concepts 
which need careful discrimination. The functions Uj(x), 
where.; = 1, 2are linearly dependent if a relation 

C l W 1 +C s M 1 +...+C„W„ = 0, . . . (1) 

exists identically in x, where one at least of the c i is not 
zero. They are functionally dependent if a functional relation 

F(u^, Mj, u n ) = 0 . . • (2) 

exists identically in x. Linear dependence for example, is 
the case w here F is a non zero linear function. They are 
uncorrelated if the product moment p n vanishes for each pair 
u { and Uj of the set. 

Correlation and functional dependence are (48) not 
necessarily the same. The simplest example is perhaps 
u = a cos x-j-b sin x, v = a sin x—b cos x. Here u and v are 
uncorrelated, yet are dependent in view of the quadratic 
relation u 2 -\-v*—a 2 —b 2 = 0. 

To describe statistical dependence , we may say that 
statistical independence is really obedience to the multiplica¬ 
tion theorem of probability. Suppose we have two functions 
of n variates, u(x, y, z) and v(x t y , z), where we illustrate by 
n — 3. They have each a probability, or probability density, 
let us say ^(w) and ^r 2 (v), depending on the distribution of 
x, y and z. They have also a compound probability, or 
probability density, let us say tp(u, v). If for all the possible 
values of x, y , z we have \ft\u, v) = 0i(w)0 a ( v )» then we say 
that u and v are statistically independent. 

An equivalent formulation is by generating functions. If 

0(a, fi) — j j je au + P v <f>{x, y , z)dxdydz . . (3) 

where <f>(x, y, z) is the compound probability density of x , y, z, 
and if Q(a, P) - G(a, 0)0(0, fi) 9 . . . (4) 

all integrals existing in some common domain of a, /?, then 
u and v are statistically independent. By this criterion it 
may be proved that the estimates m of p and « a of a 1 in a 
normal sample (72) of n values x f are statistically independent, 
so that the derivation of the ^-distribution (loc. cit.) is valid. 
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of Poisson, 50, 51, 58 
Binomial correlation, 82, 83 
Bivariate distribution, 80, 86, 
94, 95 
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Chango of origin and scale, 23, 60 
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Complementary event, 13 
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Compound probability, 14, 15 
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Contingency table, 82, 83, 99, 
100, 102, 104, 105 
Corrections, Sheppard’s, 39, 44- 
47, 73, 76, 94, 142 
Correlation, 80-103, 111, 112, 
113, 114, 148 
binomial, 82, 83 
coefficient of, 86, 87, 90-93, 
112-114 

hypergeometric, 84 
non-linear, 95-98 
non-metrical, 99-105 
partial, 113, 114 
Poissonian, 94, 95 
ratio, 95-99 
surface, 82, 88 
table, 89-93 
total, 112, 113, 114 
Covariance, 84 

Criteria of homogeneity, 54, 55 
Cumulant, 22, 23, 32, 61, 64, 
65, 70, 128 
factorial, 23, 64 
generating function, 22, 64, 
65, 67, 70 

Degrees of freedom, 102, 103, 
123, 130, 131, 133, 134, 
137-140, 143 

Density, probability, 16, 143 
Dependence, linear, 101, 102, 
103, 130, 148 
functional, 88, 148 
statistical, 13, 14, 15, 87, 88, 
102, 130, 148 

Dependent events, 15, 16, 56 
Deviation, mean absolute, 32 
standard, 35, 37 
Difference of means, 134 
Differences, finite, 59, 67, 115, 
118, 119, 145, 146 
Dispersion, 32, 34, 35 
residual, 96 
Distribution— 

binomial, 49, 58, 125, 133 
binomial of Poisson, 50, 51, 58 
bivariate, 80, 86, 94, 95 
Coolidge, 53-55 
frequency, 26 

Gamma type, 69, 72, 102. 128 
Hehnert’s 130 


Distribution— 

hypergeometric, 56, 57 
J-shaped, 27, 64 
leptokurtic, 38 
Lexian, 53, 54, 55, 72 
multinomial, 56, 101 
multivariate, 80, 101 
normal, 58-62, 72, 73, 74 
normal correlated, 86-89, 101, 
113 

of Fisher’s z, 135, 136, 138 
of r, 92, 142, 143 
of Student’s t , 131-134, 143, 
148 

of sum of squares, 69, 129 
of 101, 102 

of variance estimate, 130-131 
Pearsonian system, 67-71 
platykurtic, 38, 71, 127. 133 
Poisson, 58, 59, 63, 64, 66, 
77, 78, 103 

Poisson correlated, 94, 95 
probability, 26 
rectangular, 48, 79 
sampling, 92, 125-127, 133 
skew, 27, 31, 36, 37, 58, 69, 
72, 127, 131 
symmetrical, 27, 61 
trivariate, 80, 112, 113 
Type A, 58, 59, 64, 65, 66, 
67, 73, 75, 76 

Typo 13, 58, 59, 66, 67, 73, 
76, 77, 78 
Type I, 71, 72 

Type III, 69, 72, 102, 128, 129, 
130, 131 

U-shaped, 27, 69, 142 
Dot diagram, 80 


Ellipse, probable, 88 
Empirical formula for P(v a ), 
104 

Equal likeliness, 10, 11 
Equations, normal, 110, 114, 
115, 116, 121 

Error function, 62, 73, 74, 76, 
144, 147 

Error of mean, 128 
of moments, 39, 126 
of r, 92, 143 
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Error of sampling, 39, 92, 126- 
129, 131, 143 

of variance estimate, 129-131 
probable, 35 

standard, 39, 92, 126-129, 131, 
143 

Errors and residuals, 107, 108 
Estimation from sample, 78, 79, 
143 

Euler-Maclaurin formula, 44 
Events, 5, 6 

dependent, 13, 15 
independent, 13, 14, 15, 18 
mutually exclusive, 13, 18 
Excess, 38, 63, 131 
Expectation, mathematical, 21 

Factorial moments, 21, 22, 41, 
49, 63, 84, 85, 146, 147 
moment generating function, 
22, 49, 63, 84, 85 
polynomials, 21, 145, 146 
cumulants, 23, 64 
cumulant generating function, 
64 

Factorials and powers, 147 
Factorials, central, 145, 146 
Finite differences, 59, 67, 115, 
118, 119, 145, 146 
sums, 40-43, 146 
Fitting of harmonic function, 
120-123 

of polynomial, 114-120 
of probability curves, 73-78, 
143 

Formulae of interpolation, 145, 
146 

Fourfold tablo, 82, 85, 94, 95 
Fourier, transform, 22 
Frequency, marginal, 82, 100, 
102 

relative, 4, 5, 7, 26, 60 
Frequency polygon, 28 
Function, probability, see Dis¬ 
tribution 

Functional dependence, 88, 148 

Gamma Type, see Distribution 
Generating function, 15, 16, 17, 
19, 148 

bivariate, 83, 84 


Generating function, change of 
origin and scale in, 23 
factorial moment, 22, 49, 63 
84, 85 

factorial cumulant, 64 
moment, 20-24, 60, 65, 69, 
75, 84, 85, 86, 101, 113 
multiplication theorem, 19, 22 
cumulant, 22, 64, 65, 67, 70 
Goodness of fit, 76, 78, 100-103, 
104 

Gregory-Newton formula, 145 

Harmonic regression, 81, 120-123 
Holmort’s distribution, 130 
Histogram, 28 

Homogeneity, criteria of, 54, 55 
Hypergeometric correlation, 84 
distribution, 56, 57 

Independence, functional, 88, 
148 

linear, 101, 148 
statistical, 87, 148 
Independent events, 13, 14, 15, 
18 

frequencies, 102, 103 
Inductive synthesis, 3 
Integral, probability, 62, 72, 73, 
133, 144, 147 

Interpolation formula?, 145, 146 

J-shaped curve, 27, 64 

Kurtosis, 38, 63, 131, 133 

Latin square, 139-142 
Least squares, 95, 106-108, 110, 
111 

Leptokurtic, 38 
Lexian ratio, 55 
Lexian variance, 53, 54, 72 
Likelihood, maximum, 143 
Likeliness, equal, 10, 11 
Limit of relative frequency, 7, 8 
Limits of r and p, 87 
Linear dependence, 101, 102, 
103, 130, 148 

Linear function, of moments, 
36, 37 
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Linear regression, 88, 89, 106, 
112 

Logic, algebra of, 6 

Maximum likelihood, 143 
Mean absolute deviation, 32 
Mean central factorial, 145 
Mean, median and mode, 30, 31, 
37 

Mean square contingency, 104- 
105 

Mean square, distribution of, 129 
Measure of aggregate, 10 
Median, 30, 32-34 
Moments, 20, 31, 37, 38 
computation of, 39-43 
see Factorial moments, Gener¬ 
ating functions 

Minimum variance, principle of, 
143 

Multinomial distribution, 55, 101 
Multiplication theorem, 14, 15, 
19 

Mutually exclusive events, 13, 18 

Non-linear regression, 95, 114 
Non-metrical correlation, 99-105 
Normal curve, see Distribution 
Normal equations, 110, 114-116, 
121 

Optimal values, 108, 109 
Origin, change of, 23, 29 
Orthogonal functions, 116, 120, 
124 

polynomials, 116, 116, 120, 
124 

Parameters, estimate of, 78, 79, 
143 

Partial correlation, 113, 114 
Pearson curves, 67-71 
Pearsonian coefficient r, 86, 87, 
90-93 

Periodic regression, 81, 120-123 
Perturbation, coefficient of, 55 
Phase aggregate, 10, 12, 14 
Platykurtic, 38, 71, 127, 133 
Poisson binomial, see Distribu¬ 
tion 


Polynomial, factorial, 21, 145, 

146 

Polynomial regression, 81, 114, 
116-120 

Population, 24, 25 
Powers and factorials, 147 
Precision, 107, 108 
Preparation of normal equations, 
110 

Prismogram, 81 
Probability, 4-12 
d priori, 6, 9 

as limit of relative frequency, 
7, 8 

as measure of sub-aggregate, 

10, 12 

complementary, 13 
continuous, 12 
curve, 27 
definition, 6, 9, 12 
density, 16 

distribution, see Distribution 
function, 16 

fundamental theorems, 13, 14, 
15 

integral, 62, 72, 73, 133, 144, 

147 

marginal, 82, 100, 102 
of dependent events, 15 
parameters, 29, 30 
polygon, 28 
total, 13, 82 
Probable error, 35 
Product-moment, 84, 86, 87, 90 
Provisional mean, 39 

Quartiles, 34, 35, 37 

Randomized blocks, 138, 140, 
142 

Randomness, 9 
Range, 35, 36 
Ratio, correlation, 95-99 
Lexian, 65 
of x a variates, 135 
Student’s, 131, 134, 143, 148 
Rectangular distribution, 48, 79 
Regression, 80-82, 88, 89, 96, 
106, 112-124 
coefficients, 112 
lines and planes, 88-89,106,112 
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Relative frequency, 4, 5, 7, 26, 
60 

Replacement, sampling without, 
66 

Residual, 108 
dispersion, 96 
variance, 109, 119, 123 


Sample, 24, 25, 86, 107, 125-135 
estimation from, 78, 79, 143 
Sampling distribution, 92, 125- 
135, 143, 148 

error, 39, 92, 125-135, 143 
of r, 92, 143 

Sampling without replacement, 
56 

Science, pure and applied, 2, 3 
Seminvariants, 22, 23 
Somi-intorquartilo range, 35 
Series of Type A, B, see Dis¬ 
tribution 

Sheppard’s corrections, 39, 44- 
47, 73, 76, 94, 142 
Skewness, 27, 31, 36, 37, 58, 
69, 72, 127, 131 
Square, Latin, 139-142 
Standard deviation, 35, 37 
error, see Error of sampling 
Statistical dependence, 143 
Statistics, definition, 1, 5, 7, 
12 

Student’s t , see Distribution 
Sum of squares, analysis of, 54, 
136-110 

distribution of, 69, 129 
Summation method for moments, 
40-43, 146 
Symmetry, 27 
Synthesis, inductive, 3 


Tables, British Association, 75 
contingency, 82, 83, 99, 100, 
102, 104, 105 
correlation, 89-93 
fourfold, 82, 85, 94, 95 
of Fisher’s z , 136 
of Poisson function, 147 
of P( x 2 ), 103, 105 
of probability integral, 73, 
144, 147 

of Student’s t, 133, 134 
of terms in Type A, 75 
Tabulation, 1, 2 

Tchebychef polynomials, 115, 
117, 119, 121, 124 
Transform, Fourier, 22 
Trivariate problem, 80, 113, 114 

Universal, universe, 24 
U-shaped curve, 27, 69, 142 

Variance, 35 

analysis of, 54, 136-140 
Bernoullian, Poissonian and 
Lexian, 51-55, 72 
distribution of estimate of, 
130-131 

minimum, principle of, 143 
of linear function, 36 
of optimal value, 109 
of residuals, 109, 119, 123 
Variate, 16 

additive, 19, 64 
change of, 69, 135, 136 

Weight, 107, 108 

of arithmetic mean, 109 
Weighted mean, 109 

z -distribution, 135, 136 
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